Pulse · ggml-org/llama.cpp · GitHub

May 7, 2025 – May 14, 2025

Overview

109 Active pull requests

104 Active issues

66 Releases published by 1 person

b5308
published May 8, 2025
b5310
published May 8, 2025
b5311
published May 8, 2025
b5313
published May 8, 2025
b5315
published May 8, 2025
b5309
published May 8, 2025
b5317
published May 8, 2025
b5318
published May 8, 2025
b5320
published May 9, 2025
b5321
published May 9, 2025
b5322
published May 9, 2025
b5323
published May 9, 2025
b5324
published May 9, 2025
b5325
published May 9, 2025
b5326
published May 9, 2025
b5327
published May 9, 2025
b5328
published May 9, 2025
b5329
published May 9, 2025
b5330
published May 9, 2025
b5331
published May 9, 2025
b5332
published May 9, 2025
b5333
published May 10, 2025
b5334
published May 10, 2025
b5335
published May 10, 2025
b5336
published May 10, 2025
b5338
published May 10, 2025
b5340
published May 10, 2025
b5341
published May 10, 2025
b5342
published May 10, 2025
b5344
published May 11, 2025
b5345
published May 11, 2025
b5346
published May 11, 2025
b5347
published May 11, 2025
b5349
published May 11, 2025
b5350
published May 11, 2025
b5351
published May 12, 2025
b5352
published May 12, 2025
b5353
published May 12, 2025
b5354
published May 12, 2025
b5355
published May 12, 2025
b5356
published May 12, 2025
b5357
published May 12, 2025
b5358
published May 12, 2025
b5359
published May 12, 2025
b5360
published May 12, 2025
b5361
published May 12, 2025
b5363
published May 13, 2025
b5365
published May 13, 2025
b5366
published May 13, 2025
b5367
published May 13, 2025
b5368
published May 13, 2025
b5369
published May 13, 2025
b5370
published May 13, 2025
b5371
published May 13, 2025
b5372
published May 14, 2025
b5377
published May 14, 2025
b5378
published May 14, 2025
b5379
published May 14, 2025
b5380
published May 14, 2025
b5381
published May 14, 2025
b5382
published May 14, 2025
b5384
published May 14, 2025
b5385
published May 14, 2025
b5387
published May 14, 2025
b5388
published May 14, 2025
b5390
published May 15, 2025

84 Pull requests merged by 30 people

bench : handle decode errors
#13548 merged May 15, 2025
server: inject date_string in llama 3.x template + fix date for firefunction v2
#12802 merged May 15, 2025
kv-cache : fix out-of-bounds view during reserve graph
#13547 merged May 14, 2025
arm64: optimize q6_k_q8_k kernel with i8mm
#13519 merged May 14, 2025
common: add partial regex support
#12808 merged May 14, 2025
editorconfig : fix trailing whitespace from #13542
#13546 merged May 14, 2025
fix: crash when calling llama_state_get_size on a context without a KV cache
#13542 merged May 14, 2025
CUDA: fix crash on large batch size for quant. MoE
#13537 merged May 14, 2025
llama : fix quantize with dl backends
#13539 merged May 14, 2025
CUDA: faster Deepseek FA, add Turing support
#13435 merged May 14, 2025
Granite MoE NoPE fix
#13538 merged May 14, 2025
server : passthrough the /models endpoint during loading
#13535 merged May 14, 2025
server : fix cache_tokens bug with no cache_prompt
#13533 merged May 14, 2025
cmake: simplify vulkan shader test logic
#13263 merged May 14, 2025
vulkan: KHR_coopmat flash attention
#13506 merged May 14, 2025
webui : use fflate for more deterministic gzip compress
#13525 merged May 14, 2025
webui: Allow pasting file from clipboard
#13526 merged May 14, 2025
docs: Update link to ggml-org in multimodal.md
#13513 merged May 14, 2025
scripts : fix compare-llama-bench.py show parameter
#13514 merged May 14, 2025
vulkan: workaround FA compile failures on macos
#13517 merged May 14, 2025
quantize: improve pattern matching for allowed tensors
#13033 merged May 13, 2025
clip : clip.h become private API (⚠️ breaking change)
#13510 merged May 13, 2025
metal : use FA-vec kernel up to batch size 20
#13496 merged May 13, 2025
metal : optimize multi-sequence FA vec kernel
#13493 merged May 13, 2025
ggml-cpu: Update KleidiAI to v1.6 and fix include directives
#13509 merged May 13, 2025
batched-bench : fix pp batch contents
#13492 merged May 13, 2025
mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change)
#13460 merged May 13, 2025
scripts : support arbitrary input file formats in compare-llama-bench.py
#13455 merged May 13, 2025
Model: Granite MoE shared
#13269 merged May 13, 2025
sync : ggml
#13502 merged May 13, 2025
llama-bench : add defrag-thold, check for invalid ranges
#13487 merged May 12, 2025
opencl: remove unnecessary assert for add
#13257 merged May 12, 2025
clip : cap max image size 1024 for qwen vl model
#13478 merged May 12, 2025
llama/ggml: add LLM training support
#10544 merged May 12, 2025
context : fix state io for memory-less contexts
#13470 merged May 12, 2025
Allow content null for tool call
#13477 merged May 12, 2025
llama-bench : accept ranges for integer parameters
#13410 merged May 12, 2025
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel
#13053 merged May 12, 2025
CUDA: fix misaligned synchronization in FA
#13469 merged May 12, 2025
ggml : add mrope kernel for metal
#13457 merged May 12, 2025
sycl: enable dpcpp nightly builds with oneMKL and oneDNN
#13406 merged May 12, 2025
mtmd : use RMS norm for InternVL 3 38B and 78B mmproj
#13459 merged May 11, 2025
tools : fix invalid free()
#13436 merged May 11, 2025
scripts : exit compare-llama-bench.py gracefully when there's nothing to compare
#13451 merged May 11, 2025
CUDA: fix crash with partial offloading of MoE
#13439 merged May 11, 2025
Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B
#13386 merged May 11, 2025
mtmd : support InternVL 3 38B and 78B mmproj
#13443 merged May 11, 2025
mtmd : move helpers to dedicated file
#13442 merged May 11, 2025
readme: Fix typo in InternVL model name
#13440 merged May 10, 2025
CUDA: fix race conditions in FlashAttention kernels
#13438 merged May 10, 2025
vocab : add ByteDance-Seed/Seed-Coder
#13423 merged May 10, 2025
mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl
#13434 merged May 10, 2025
server : update docs
#13432 merged May 10, 2025
llguidance : init tokenizer slices
#13424 merged May 10, 2025
ci: free_disk_space flag enabled for intel variant
#13426 merged May 10, 2025
mtmd : support InternVL 2.5 and 3
#13422 merged May 10, 2025
CUDA: fix FlashAttention on Turing
#13415 merged May 10, 2025
arg : add env var to control mmproj
#13416 merged May 10, 2025
vulkan: scalar flash attention implementation
#13324 merged May 10, 2025
Use tagged version of llguidance that does not break the build
#13413 merged May 9, 2025
server : vision support via libmtmd
#12898 merged May 9, 2025
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs
#12858 merged May 9, 2025
metal : optimize MoE for large batches
#13388 merged May 9, 2025
CUDA: FA support for Deepseek (Ampere or newer)
#13306 merged May 9, 2025
llama : do not crash if there is no CPU backend
#13395 merged May 9, 2025
CUDA: fix crash on large batch size for MoE models
#13384 merged May 9, 2025
Add --parse-special for enabling parsing of special tokens in imatrix calculation
#13389 merged May 9, 2025
llama-run: add support for downloading models from ModelScope
#13370 merged May 9, 2025
mtmd : fix batch_view for m-rope
#13397 merged May 9, 2025
llama : one-off chat template fix for Mistral-Small-2503
#13398 merged May 9, 2025
rpc : add rpc_msg_set_tensor_hash_req
#13353 merged May 9, 2025
vulkan: Allow up to 4096 elements for mul_mat_id row_ids
#13326 merged May 9, 2025
server : (webui) rename has_multimodal --> modalities
#13393 merged May 9, 2025
ci : limit write permission to only the release step + fixes
#13392 merged May 8, 2025
mtmd: Expose helper_decode_image_chunk
#13366 merged May 8, 2025
server : (webui) fix a very small misalignment
#13387 merged May 8, 2025
server : (webui) revamp the input area, plus many small UI improvements
#13365 merged May 8, 2025
convert : support rope_scaling type and rope_type
#13349 merged May 8, 2025
mtmd: Fix the calculation of n_tokens for smolvlm
#13381 merged May 8, 2025
context : allow cache-less context for embeddings
#13108 merged May 8, 2025
context : remove logits_all flag
#13284 merged May 8, 2025
ci : move release workflow to a separate file
#13362 merged May 8, 2025
llama : print size and type of overridden tensors
#13364 merged May 8, 2025
sycl: addressing non-contiguous src1 mul_mats (nc and batched)
#13343 merged May 8, 2025

25 Pull requests opened by 21 people

gguf-py: Optimize `GGUFReader` read-only mode performance
#13378 opened May 8, 2025
musa: restore MUSA graph settings in CMakeLists.txt
#13382 opened May 8, 2025
sycl: simplify bin_bcast_kernel
#13383 opened May 8, 2025
arg : add model catalog
#13385 opened May 8, 2025
grammar: handle misplaced special regex chars [*+?]
#13391 opened May 8, 2025
server : PoC implementation of "interim" server
#13400 opened May 9, 2025
Update README.md for using llama.cpp in Microsoft Word locally
#13401 opened May 9, 2025
Break down main function in llama-server
#13425 opened May 10, 2025
Webui dynamic config
#13429 opened May 10, 2025
llama: Add configuration presets for chat and reranking servers
#13462 opened May 12, 2025
Support Seed-Coder chat template
#13472 opened May 12, 2025
docker : enable RPC for docker images
#13474 opened May 12, 2025
[SYCL] Overcoming workaround for mmap() allocation on Windows
#13482 opened May 12, 2025
feat(server): Add tool call support to WebUI (LLama Server)
#13501 opened May 13, 2025
convert: Swap GLM4 EOS / EOT token
#13505 opened May 13, 2025
webui: Add editing assistant messages (#11849)
#13522 opened May 14, 2025
cuda: set cuda compiler path (#13527)
#13528 opened May 14, 2025
MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now)
#13529 opened May 14, 2025
ci : upgraded oneAPI version in SYCL workflows and dockerfile
#13532 opened May 14, 2025
sycl: disable reorder for sycl mulmat
#13536 opened May 14, 2025
fix: proper error handling for missing elements in messages array (OpenAI compatible backend)
#13540 opened May 14, 2025
Fix build on OpenBSD
#13541 opened May 14, 2025
sycl : reviewing the backend documentation
#13544 opened May 14, 2025
Granite Four
#13550 opened May 14, 2025
webui : improve accessibility for visually impaired people
#13551 opened May 14, 2025

66 Issues closed by 20 people

Eval bug: Jinja not replacing `date_string`
#12729 closed May 15, 2025
Misc. bug: GGML_ASSERT(view_src == NULL || data_size == 0 || data_size + view_offs <= ggml_nbytes(view_src)) failed
#13359 closed May 14, 2025
Misc. bug: Llama-Quantize.exe broken on win11 since b5298 , but works on/earlier b5215
#13518 closed May 14, 2025
Eval bug: Segmentation fault when using llama-quantize
#13380 closed May 14, 2025
server: Describing pictures with multi models seems to crash the model
#13480 closed May 14, 2025
Misc. bug: GGML_ASSERT(n <= tokens.size()) failed - Memory in use ('/completion' endpoint and 'cache_prompt=false')
#13484 closed May 14, 2025
Question regarding the quantization dimension of the weight such as Q4_K format
#13377 closed May 14, 2025
Eval bug: Qwen3 30B adds spaces to end of each line
#13508 closed May 14, 2025
Compile bug: compile cuda backend error
#13527 closed May 14, 2025
Compile bug: cuda backend compile error
#12893 closed May 14, 2025
Misc. bug: Compute pipeline creation failed when using Flash Attention on macOS/Vulkan
#13450 closed May 14, 2025
csm : implement Sesame-based conversation example
#12392 closed May 14, 2025
Eval bug: llama-qwen2vl-cli --log-disable rather disables the response, not the log
#12407 closed May 14, 2025
Eval bug: Unusual high RAM usage on Windows when running DeepSeek V3 Q2_K_XL/IQ2_XXS, on Hybrid CPU+GPU (vs Linux).
#12651 closed May 14, 2025
Misc. bug: Gibbersish output on AMD Ryzen 9 8945HS w/ Radeon 780M Graphics since commit: 3d82dbcbce2c
#12657 closed May 14, 2025
Misc. bug: since b4800 llama-cli does not prompt and llama-bench shows no results
#13452 closed May 13, 2025
When using the qwen2.5-vl model on AMD Ryzen APU under Windows, the error "failed to allocate Vulkan0 buffer of size 4342230552" may appear.
#13250 closed May 13, 2025
What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA?
#13504 closed May 13, 2025
Misc. bug: can't convert finetuned gemma3 model
#13490 closed May 13, 2025
Eval bug: Phi-4 mini in iOS with xcframework
#12232 closed May 13, 2025
Feature Request: convert_hf_to_gguf.py to support model type Qwen2_5_VLForConditionalGeneration
#12642 closed May 13, 2025
GGML_ASSERT(cur_p->size > 0) failed, or gibberish on DeepSeek V3 0324 (Q2_K_XL), CUDA + CPU
#13461 closed May 12, 2025
Compile bug: nvcc fatal : Unsupported gpu architecture 'compute_120'
#13271 closed May 12, 2025
Eval bug: Qwen2.5-vl在AMD GPU上做图像识别时崩溃（分辨率1242*881）
#13445 closed May 12, 2025
Segfault when submitting image to ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
#13467 closed May 12, 2025
Misc. bug: crashes when calling `llama_state_get_size` on a reranking model
#13463 closed May 12, 2025
Tool call errors with `Expected 'content' to be a string or an array`
#13471 closed May 12, 2025
Misc. bug: rpc-server crash without cache
#13185 closed May 12, 2025
Compile bug: SYCL backend build fail on debug config
#12602 closed May 12, 2025
Misc. bug:
#12623 closed May 12, 2025
Eval bug: mmvq.cu:519: GGML_ASSERT(!src0->view_src) failed
#13437 closed May 11, 2025
Feature Request: Allow disabling `offload_op` for backends by user
#13241 closed May 11, 2025
Compile bug: MinGW32_64 Vulkan Shader
#13419 closed May 11, 2025
Eval bug: run failed when run lora adapter(no merged) on android
#12592 closed May 11, 2025
[New Bitnet Model Support Request] Deepgrove model Bonsai 0.5B - Add Channel Scales
#12598 closed May 11, 2025
Feature Request: Add "trust_remote_code support" to 'convert_hf_to_gguf.py' for compatibility with modern HF models
#12610 closed May 11, 2025
Misc. bug: Data check in examples/gguf
#12617 closed May 11, 2025
Eval bug: b5335 break flash attention on 4070
#13430 closed May 10, 2025
ByteDance-Seed/Seed-Coder unsupported?
#13421 closed May 10, 2025
Misc. bug: Qwen3 30B A3B Q4_K_M loads on server but quickly dies after requesting inference through Llama.cpp web UI
#13164 closed May 10, 2025
Eval bug: mtmd in server mode crashes on too big image
#13414 closed May 10, 2025
Update server documentation with new mmproj configuration options
#13431 closed May 10, 2025
Misc. bug: Intel container images keep getting `No space left on device` during CI Build
#13052 closed May 10, 2025
Misc. bug: [SYCL] Unexpected "setvars.sh has already been run" warning
#13333 closed May 10, 2025
Eval bug: the swiftui keeps saying the same thing
#12558 closed May 10, 2025
Misc. bug: performance drop with 2x SYCL GPUs
#12575 closed May 10, 2025
-ngl to load ·last n layers· to gpu
#12577 closed May 10, 2025
Compile bug: vulkan-shaders-gen hangs when built with address sanitizers
#12581 closed May 10, 2025
Qwen2.5-vl support and conversion？
#12584 closed May 10, 2025
Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory
#12586 closed May 10, 2025
server: Bring back multimodal support
#8010 closed May 9, 2025
server : add support for file upload to the Web UI
#11611 closed May 9, 2025
Compile bug: Build breaks with llguidance
#13412 closed May 9, 2025
Eval bug: ggml_cuda_compute_forward: MUL_MAT failed when using FA + MLA on DeepSeekv3 0324, on mixed CPU + GPU
#13252 closed May 9, 2025
`CUDA error: invalid configuration argument` for MoEs - `--ubatch-size 8192` exceeds `INT_MAX`
#13376 closed May 9, 2025
Eval bug: mtmd Qwen2.5VL 7B not seeing an image as expected
#13394 closed May 9, 2025
Feature Request: Prefix assistant answer
#11536 closed May 9, 2025
Misc. bug: auto scroll doesn't work in WebUI
#12362 closed May 9, 2025
Eval bug: inference of 32B eats too much memory on ROCM HIP (5x AMD Radeon Instinct Mi50 (gfx906))
#12369 closed May 9, 2025
Feature Request: allow mmap to take advantage of hugepage feature which has 10x speedup
#12444 closed May 9, 2025
Misc. bug: Flash attention on Vulkan
#12526 closed May 9, 2025
Eval bug: seemed it cannot convert theQwen2.5-VL-7B-Instruct, please help advice, Thank you.
#12534 closed May 9, 2025
Eval bug: crash when pooling_type == LLAMA_POOLING_TYPE_MEAN
#12543 closed May 9, 2025
Misc. bug: vulkan: performance regression after fd123cfead49eb32e386e26b8ef7a6d41554dda5
#12553 closed May 9, 2025
Eval bug: Using llama-llava-clip-quantize-cli under CUDA backend conditions will encounter a crash.
#12564 closed May 9, 2025
Misc. bug: The following tests FAILED: 23 - test-arg-parser (Subprocess aborted) main
#13371 closed May 8, 2025

38 Issues opened by 32 people

Misc. bug: missing messages in JSON export via llama-server web UI
#13552 opened May 14, 2025
Misc. bug: Potential out of bound in rerank
#13549 opened May 14, 2025
Misc. bug: -sm row results in gibberish output on HIP (ROCm 6.3.3)
#13545 opened May 14, 2025
Eval bug: nomic-embed-text-v2-moe GGML_ASSERT(pc_type == ...) failed
#13534 opened May 14, 2025
webui: Make the Web UI more accessible for blind users
#13531 opened May 14, 2025
tutorials : list for llama.cpp
#13523 opened May 14, 2025
Research: How to integrate VITA 1.5 for multi-modal GGUF deployment?
#13520 opened May 14, 2025
Eval bug: bizarre Jinja bug when trying to fix Qwen3 tool calling
#13516 opened May 13, 2025
Feature Request: Apple just release Fast-VLM, a very promising set of multimodal language models
#13512 opened May 13, 2025
Misc. bug: llama-cli stopped starting in release b4191 (c9b00a7)
#13498 opened May 13, 2025
kv-cache : improve defrag logic
#13497 opened May 13, 2025
Eval bug: BGE-M3 Embedding model is not accessible
#13494 opened May 13, 2025
Misc. bug: In Windows, llama-bench does not recognize the -ot or --override-tensors parameter.
#13491 opened May 13, 2025
Eval bug: I just finetuned gpt2 model with lora and save it to gguf file but not properly worked
#13489 opened May 12, 2025
Partial offload support for training
#13486 opened May 12, 2025
LoRA training example
#13485 opened May 12, 2025
web UI either doesn't scroll or jumps to the wrong element
#13479 opened May 12, 2025
Eval bug: I cannot run llama 405b on CPU
#13475 opened May 12, 2025
Why mul_mat in ggml slower than llama.cpp?
#13473 opened May 12, 2025
How to start gemma3 multimodal model service using llama_server
#13465 opened May 12, 2025
Phi-4-mini reasoning CRASH!!! (Vulkan)
#13464 opened May 12, 2025
Feature Request: add draft model in llama-bench and more.
#13456 opened May 11, 2025
Eval bug: llama-mtmd-cli doesn't support system prompts
#13454 opened May 11, 2025
Misc. bug: Illegal CUDA memory access in ggml_backend_cuda_cpy_tensor_async
#13449 opened May 11, 2025
Drop support for sentencepiece
#13448 opened May 11, 2025
Compile bug: ld returned 1 exit status (file bigger than 2gb)
#13446 opened May 11, 2025
Eval bug: llama-speculative core dump with Qwen3, GGML_ASSERT(batch.n_tokens > 0) failed
#13433 opened May 10, 2025
Misc. bug: The web UI of llama-server is not displaying correctly.
#13428 opened May 10, 2025
Eval bug: Qwen3-30B-A3B-Q4_K_M: Slows down when using the \no_think mode.
#13427 opened May 10, 2025
Differential mode for llama-bench + plotting code
#13408 opened May 9, 2025
Misc. bug: llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
#13405 opened May 9, 2025
Eval bug: llama-cli, Qwen3 jinja template will break CLI multiturn conversation
#13404 opened May 9, 2025
Eval bug: llama-cli, spurious token added to assistant response
#13402 opened May 9, 2025
Misc. bug: Model not loaded on Android with NDK
#13399 opened May 9, 2025
Misc. bug: invalid regex grammar causes segment violation
#13390 opened May 8, 2025
Compile bug: ninja: build stopped: subcommand failed.
#13375 opened May 8, 2025
CI: editorconfig-checker appears to have made a false positive judgment on "Trailing whitespace"
#13374 opened May 8, 2025
Token Generation Speed Decline with GGUF Models on M3 Ultra
#13373 opened May 8, 2025

90 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

kv-cache : add SWA support
#13194 commented on May 14, 2025 • 13 new comments
sycl: use oneDNN for matrices multiplication
#12972 commented on May 14, 2025 • 10 new comments
sycl : Implemented reorder Q4_K mmvq
#13109 commented on May 15, 2025 • 6 new comments
cuda: refactored ssm_scan and use CUB
#13291 commented on May 11, 2025 • 5 new comments
[CANN]Support OP MUL_MAT_ID
#13042 commented on May 14, 2025 • 4 new comments
feat: First pass at llama_kv_cache_hybrid
#13276 commented on May 14, 2025 • 4 new comments
Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method (TMAC)
#13206 commented on May 14, 2025 • 3 new comments
[CANN] Simplify the environment variable setting for GGML_CANN_MEM_POOL and GGML_CANN_ASYNC_MODE
#13104 commented on May 14, 2025 • 3 new comments
Fix Vulkan glslc invocation command lines
#13289 commented on May 8, 2025 • 2 new comments
llama : try loading tensors with pre-computed hashes
#13106 commented on May 12, 2025 • 2 new comments
common: add default reranker presets
#13352 commented on May 9, 2025 • 1 new comment
Compile bug: ggml-cuda/opt-step-adamw.cu error: identifier "__Poly8x8_t" is undefined on Jetson Orin AGX
#12826 commented on May 15, 2025 • 0 new comments
Eval bug: got exception: {"code":500,"message":"Unsupported param: echo","type":"server_error"}
#12591 commented on May 15, 2025 • 0 new comments
Misc. bug: ALL gguf models fail to run (no log, docker exit code 139),
#12205 commented on May 15, 2025 • 0 new comments
Feature Request: resize an existing context
#11577 commented on May 15, 2025 • 0 new comments
llama : initial Mamba-2 support
#9126 commented on May 14, 2025 • 0 new comments
[Draft] Tensor Parallel support to llama.cpp
#9648 commented on May 14, 2025 • 0 new comments
Allow user to compile with any cuda version using github actions
#10928 commented on May 12, 2025 • 0 new comments
tool-call: add support for tool-calls using Model Context Protocol
#11556 commented on May 13, 2025 • 0 new comments
[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs
#12063 commented on May 13, 2025 • 0 new comments
CUDA: implementation of mul_mat_id
#12859 commented on May 15, 2025 • 0 new comments
what *tool/framework* to use if testing performance of .gguf models
#12901 commented on May 15, 2025 • 0 new comments
Misc. bug: llama-bench --tensor-split handling is broken
#12917 commented on May 15, 2025 • 0 new comments
Compile bug: macro "DECL_FATTN_MMA_F16_CASE" requires 3 arguments, but only 2 given
#12921 commented on May 15, 2025 • 0 new comments
Misc. bug: llama-server "terminate called after throwing an instance of 'std::runtime_error'"
#12939 commented on May 15, 2025 • 0 new comments
Model conversion issue
#12941 commented on May 15, 2025 • 0 new comments
Feature Request: Granite 4 Support
#13275 commented on May 14, 2025 • 0 new comments
Eval bug: Qwen3, failed to parse chat template (jinja)
#13178 commented on May 14, 2025 • 0 new comments
CUDA: update build CTK version to 12.8
#13360 commented on May 14, 2025 • 0 new comments
SYCL: Fix test-backend-ops crashes with SYCL-Graph
#13357 commented on May 12, 2025 • 0 new comments
[Perf] [CPU] eliminate redundant memory access in group query attention
#13319 commented on May 12, 2025 • 0 new comments
Added dynamic context size. This is perfect for servers running llama models as a service.
#13295 commented on May 11, 2025 • 0 new comments
Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client
#13196 commented on May 12, 2025 • 0 new comments
[CANN] Update CANN model support status
#13162 commented on May 14, 2025 • 0 new comments
musa: add support for muBLAS and MMA
#13149 commented on May 8, 2025 • 0 new comments
quantize: Handle user-defined pruning of whole layers (blocks)
#13037 commented on May 11, 2025 • 0 new comments
convert : write tensors in parallel
#12837 commented on May 8, 2025 • 0 new comments
opencl: fix couple crashes
#12795 commented on May 14, 2025 • 0 new comments
Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications
#12727 commented on May 8, 2025 • 0 new comments
imatrix: add option to display importance score statistics for a given imatrix file
#12718 commented on May 12, 2025 • 0 new comments
tts : implement sesame CSM + Mimi decoder
#12648 commented on May 12, 2025 • 0 new comments
opencl: Add support for multiple devices
#12622 commented on May 14, 2025 • 0 new comments
`server`: streaming of tool calls and thoughts when `--jinja` is on
#12379 commented on May 14, 2025 • 0 new comments
PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp
#12326 commented on May 11, 2025 • 0 new comments
vulkan: optimization proposals for coopmat1 mul_mm
#12260 commented on May 10, 2025 • 0 new comments
Misc. bug: llama-quantize clobbers input file + crashes when output file matches
#12753 commented on May 14, 2025 • 0 new comments
Compile bug: llama.cpp-master/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp:80:54:error '_mm256_set_m128i' was not declared in this scope
#11385 commented on May 11, 2025 • 0 new comments
Prompt eval is 5x slower than in Ollama and maxes out the CPU
#12237 commented on May 11, 2025 • 0 new comments
Feature Request: Slim Attention (lossless 2x reduction in KV cache size)
#12359 commented on May 11, 2025 • 0 new comments
Eval bug: Accuracy is dropped when I convert model to gguf. Qwen2_VL_7B_Instruct
#12538 commented on May 11, 2025 • 0 new comments
Misc. bug: convert_hf_to_gguf.py fails to convert the model of architecture T5ForConditionalGeneration
#12862 commented on May 11, 2025 • 0 new comments
Eval bug: Assertion _LIBCPP_ASSERT_VALID_ELEMENT_ACCESS while using a particular model
#12877 commented on May 11, 2025 • 0 new comments
Eval bug: add support for https://huggingface.co/
#12884 commented on May 11, 2025 • 0 new comments
Eval bug: moonshotai/Moonlight-16B-A3B-Instruct
#12880 commented on May 11, 2025 • 0 new comments
Misc. bug: llama-server webui overriding command line parameters
#13277 commented on May 10, 2025 • 0 new comments
Eval bug: Regex
#13347 commented on May 10, 2025 • 0 new comments
Compile bug: Build failure for Intel oneMKL on Windows
#12478 commented on May 10, 2025 • 0 new comments
Add support for gemma 3 in the server?
#12762 commented on May 10, 2025 • 0 new comments
CUDA performance bug when two cards are visible and only one is used
#12838 commented on May 10, 2025 • 0 new comments
Eval bug: llama-server can only load 27 layers into the Vulkan, but llama-run can load 33 layers for no apparent reason
#12840 commented on May 10, 2025 • 0 new comments
Eval bug: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
#12857 commented on May 10, 2025 • 0 new comments
Compile bug: I tried compiling llama.cpp for HIP on my system (elementaryOS 8/ubuntu 24.04, rocm 6.4.0, gfx1100) using the installation guide
#13340 commented on May 9, 2025 • 0 new comments
Feature Request: add jina embeddings model availible convert to gguf
#12327 commented on May 9, 2025 • 0 new comments
OpenCL: Performance comparison depending on gpu_offloads
#12810 commented on May 9, 2025 • 0 new comments
Llama 4 convert_hf_to_gguf.py tokenizer error
#12819 commented on May 9, 2025 • 0 new comments
Misc. bug: Qwen 3.0 "enable_thinking" parameter not working
#13160 commented on May 8, 2025 • 0 new comments
Eval bug: Qwen3 Q4_0 not working with SYCL
#13163 commented on May 8, 2025 • 0 new comments
changelog : `libllama` API
#9289 commented on May 8, 2025 • 0 new comments
Misc. bug: Inconsistent Vulkan segfault
#10528 commented on May 14, 2025 • 0 new comments
(Discussion) Improve usability of llama-server
#13367 commented on May 14, 2025 • 0 new comments
Feature Request: Qwen2.5-Omni
#12673 commented on May 14, 2025 • 0 new comments
Eval bug: ggml_vulkan: Device memory allocation of size N failed with ub > 4096 and c > 4096 and b > 4096
#12817 commented on May 14, 2025 • 0 new comments
Eval bug: ROCm error: CUBLAS_STATUS_INTERNAL_ERROR
#12878 commented on May 14, 2025 • 0 new comments
Misc. bug: gguf-my-repo doesn't work - [Errno 2] No such file or directory: './llama.cpp/llama-quantize'
#12925 commented on May 14, 2025 • 0 new comments
Misc. bug: The llama-server not read the "--keep" param that user input in the cli
#12927 commented on May 14, 2025 • 0 new comments
Eval bug: Can't run Qwen3-32B Q4_K_XL
#13298 commented on May 13, 2025 • 0 new comments
Move gguf fuzzers to the llama.cpp repository
#11514 commented on May 13, 2025 • 0 new comments
Feature Request: moondream2 vlm support in mtmd
#13332 commented on May 13, 2025 • 0 new comments
Feature Request: Add support of convert.py for model Qwen2.5-Omni-7B
#12641 commented on May 13, 2025 • 0 new comments
Feature Request: XiaomiMiMo/MiMo-7B-RL
#13218 commented on May 13, 2025 • 0 new comments
Qwen3-8B and other models generate garbage output / repeat tokens (GGGGGG...) in llama.cpp via LM Studio Vulkan backend
#13310 commented on May 13, 2025 • 0 new comments
Feature Request: Free up VRAM when llama-server not in use
#11703 commented on May 13, 2025 • 0 new comments
Feature Request: Qwen 2.5 VL
#11483 commented on May 12, 2025 • 0 new comments
Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc
#11333 commented on May 12, 2025 • 0 new comments
Eval bug: Crash in trim method
#12710 commented on May 12, 2025 • 0 new comments
multiple_choice_score : task 17 does not fit in the context window
#12905 commented on May 12, 2025 • 0 new comments
How to use *chat_template* with .gguf models ? (tokenizer_name not implemented)
#12897 commented on May 12, 2025 • 0 new comments
Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK
#13281 commented on May 11, 2025 • 0 new comments
changelog : `llama-server` REST API
#9291 commented on May 11, 2025 • 0 new comments
Feature Request: Support for Qwen2-VL
#9246 commented on May 11, 2025 • 0 new comments