-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Insights: ggml-org/llama.cpp
Overview
Could not load contribution data
Please try again later
66 Releases published by 1 person
-
b5308
published
May 8, 2025 -
b5310
published
May 8, 2025 -
b5311
published
May 8, 2025 -
b5313
published
May 8, 2025 -
b5315
published
May 8, 2025 -
b5309
published
May 8, 2025 -
b5317
published
May 8, 2025 -
b5318
published
May 8, 2025 -
b5320
published
May 9, 2025 -
b5321
published
May 9, 2025 -
b5322
published
May 9, 2025 -
b5323
published
May 9, 2025 -
b5324
published
May 9, 2025 -
b5325
published
May 9, 2025 -
b5326
published
May 9, 2025 -
b5327
published
May 9, 2025 -
b5328
published
May 9, 2025 -
b5329
published
May 9, 2025 -
b5330
published
May 9, 2025 -
b5331
published
May 9, 2025 -
b5332
published
May 9, 2025 -
b5333
published
May 10, 2025 -
b5334
published
May 10, 2025 -
b5335
published
May 10, 2025 -
b5336
published
May 10, 2025 -
b5338
published
May 10, 2025 -
b5340
published
May 10, 2025 -
b5341
published
May 10, 2025 -
b5342
published
May 10, 2025 -
b5344
published
May 11, 2025 -
b5345
published
May 11, 2025 -
b5346
published
May 11, 2025 -
b5347
published
May 11, 2025 -
b5349
published
May 11, 2025 -
b5350
published
May 11, 2025 -
b5351
published
May 12, 2025 -
b5352
published
May 12, 2025 -
b5353
published
May 12, 2025 -
b5354
published
May 12, 2025 -
b5355
published
May 12, 2025 -
b5356
published
May 12, 2025 -
b5357
published
May 12, 2025 -
b5358
published
May 12, 2025 -
b5359
published
May 12, 2025 -
b5360
published
May 12, 2025 -
b5361
published
May 12, 2025 -
b5363
published
May 13, 2025 -
b5365
published
May 13, 2025 -
b5366
published
May 13, 2025 -
b5367
published
May 13, 2025 -
b5368
published
May 13, 2025 -
b5369
published
May 13, 2025 -
b5370
published
May 13, 2025 -
b5371
published
May 13, 2025 -
b5372
published
May 14, 2025 -
b5377
published
May 14, 2025 -
b5378
published
May 14, 2025 -
b5379
published
May 14, 2025 -
b5380
published
May 14, 2025 -
b5381
published
May 14, 2025 -
b5382
published
May 14, 2025 -
b5384
published
May 14, 2025 -
b5385
published
May 14, 2025 -
b5387
published
May 14, 2025 -
b5388
published
May 14, 2025 -
b5390
published
May 15, 2025
84 Pull requests merged by 30 people
-
bench : handle decode errors
#13548 merged
May 15, 2025 -
server
: inject date_string in llama 3.x template + fix date for firefunction v2#12802 merged
May 15, 2025 -
kv-cache : fix out-of-bounds view during reserve graph
#13547 merged
May 14, 2025 -
arm64: optimize q6_k_q8_k kernel with i8mm
#13519 merged
May 14, 2025 -
common
: add partial regex support#12808 merged
May 14, 2025 -
editorconfig : fix trailing whitespace from #13542
#13546 merged
May 14, 2025 -
fix: crash when calling
llama_state_get_size
on a context without a KV cache#13542 merged
May 14, 2025 -
CUDA: fix crash on large batch size for quant. MoE
#13537 merged
May 14, 2025 -
llama : fix quantize with dl backends
#13539 merged
May 14, 2025 -
CUDA: faster Deepseek FA, add Turing support
#13435 merged
May 14, 2025 -
Granite MoE NoPE fix
#13538 merged
May 14, 2025 -
server : passthrough the /models endpoint during loading
#13535 merged
May 14, 2025 -
server : fix cache_tokens bug with no cache_prompt
#13533 merged
May 14, 2025 -
cmake: simplify vulkan shader test logic
#13263 merged
May 14, 2025 -
vulkan: KHR_coopmat flash attention
#13506 merged
May 14, 2025 -
webui : use fflate for more deterministic gzip compress
#13525 merged
May 14, 2025 -
webui: Allow pasting file from clipboard
#13526 merged
May 14, 2025 -
docs: Update link to ggml-org in multimodal.md
#13513 merged
May 14, 2025 -
scripts : fix compare-llama-bench.py show parameter
#13514 merged
May 14, 2025 -
vulkan: workaround FA compile failures on macos
#13517 merged
May 14, 2025 -
quantize: improve pattern matching for allowed tensors
#13033 merged
May 13, 2025 -
clip : clip.h become private API (⚠️ breaking change)
#13510 merged
May 13, 2025 -
metal : use FA-vec kernel up to batch size 20
#13496 merged
May 13, 2025 -
metal : optimize multi-sequence FA vec kernel
#13493 merged
May 13, 2025 -
ggml-cpu: Update KleidiAI to v1.6 and fix include directives
#13509 merged
May 13, 2025 -
batched-bench : fix pp batch contents
#13492 merged
May 13, 2025 -
mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change)
#13460 merged
May 13, 2025 -
scripts : support arbitrary input file formats in compare-llama-bench.py
#13455 merged
May 13, 2025 -
Model: Granite MoE shared
#13269 merged
May 13, 2025 -
sync : ggml
#13502 merged
May 13, 2025 -
llama-bench : add defrag-thold, check for invalid ranges
#13487 merged
May 12, 2025 -
opencl: remove unnecessary assert for
add
#13257 merged
May 12, 2025 -
clip : cap max image size 1024 for qwen vl model
#13478 merged
May 12, 2025 -
llama/ggml: add LLM training support
#10544 merged
May 12, 2025 -
context : fix state io for memory-less contexts
#13470 merged
May 12, 2025 -
Allow content null for tool call
#13477 merged
May 12, 2025 -
llama-bench : accept ranges for integer parameters
#13410 merged
May 12, 2025 -
ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel
#13053 merged
May 12, 2025 -
CUDA: fix misaligned synchronization in FA
#13469 merged
May 12, 2025 -
ggml : add mrope kernel for metal
#13457 merged
May 12, 2025 -
sycl: enable dpcpp nightly builds with oneMKL and oneDNN
#13406 merged
May 12, 2025 -
mtmd : use RMS norm for InternVL 3 38B and 78B mmproj
#13459 merged
May 11, 2025 -
tools : fix invalid free()
#13436 merged
May 11, 2025 -
scripts : exit compare-llama-bench.py gracefully when there's nothing to compare
#13451 merged
May 11, 2025 -
CUDA: fix crash with partial offloading of MoE
#13439 merged
May 11, 2025 -
Add
--no-op-offload
to improve-ot
pp perf in MoE models like llama4 400B#13386 merged
May 11, 2025 -
mtmd : support InternVL 3 38B and 78B mmproj
#13443 merged
May 11, 2025 -
mtmd : move helpers to dedicated file
#13442 merged
May 11, 2025 -
readme: Fix typo in InternVL model name
#13440 merged
May 10, 2025 -
CUDA: fix race conditions in FlashAttention kernels
#13438 merged
May 10, 2025 -
vocab : add ByteDance-Seed/Seed-Coder
#13423 merged
May 10, 2025 -
mtmd : add hard limit on image resolution for qwen2vl / qwen2.5vl
#13434 merged
May 10, 2025 -
server : update docs
#13432 merged
May 10, 2025 -
llguidance : init tokenizer slices
#13424 merged
May 10, 2025 -
ci:
free_disk_space
flag enabled for intel variant#13426 merged
May 10, 2025 -
mtmd : support InternVL 2.5 and 3
#13422 merged
May 10, 2025 -
CUDA: fix FlashAttention on Turing
#13415 merged
May 10, 2025 -
arg : add env var to control mmproj
#13416 merged
May 10, 2025 -
vulkan: scalar flash attention implementation
#13324 merged
May 10, 2025 -
Use tagged version of llguidance that does not break the build
#13413 merged
May 9, 2025 -
server : vision support via libmtmd
#12898 merged
May 9, 2025 -
sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs
#12858 merged
May 9, 2025 -
metal : optimize MoE for large batches
#13388 merged
May 9, 2025 -
CUDA: FA support for Deepseek (Ampere or newer)
#13306 merged
May 9, 2025 -
llama : do not crash if there is no CPU backend
#13395 merged
May 9, 2025 -
CUDA: fix crash on large batch size for MoE models
#13384 merged
May 9, 2025 -
Add --parse-special for enabling parsing of special tokens in imatrix calculation
#13389 merged
May 9, 2025 -
llama-run: add support for downloading models from ModelScope
#13370 merged
May 9, 2025 -
mtmd : fix batch_view for m-rope
#13397 merged
May 9, 2025 -
llama : one-off chat template fix for Mistral-Small-2503
#13398 merged
May 9, 2025 -
rpc : add rpc_msg_set_tensor_hash_req
#13353 merged
May 9, 2025 -
vulkan: Allow up to 4096 elements for mul_mat_id row_ids
#13326 merged
May 9, 2025 -
server : (webui) rename has_multimodal --> modalities
#13393 merged
May 9, 2025 -
ci : limit write permission to only the release step + fixes
#13392 merged
May 8, 2025 -
mtmd: Expose helper_decode_image_chunk
#13366 merged
May 8, 2025 -
server : (webui) fix a very small misalignment
#13387 merged
May 8, 2025 -
server : (webui) revamp the input area, plus many small UI improvements
#13365 merged
May 8, 2025 -
convert : support rope_scaling type and rope_type
#13349 merged
May 8, 2025 -
mtmd: Fix the calculation of n_tokens for smolvlm
#13381 merged
May 8, 2025 -
context : allow cache-less context for embeddings
#13108 merged
May 8, 2025 -
context : remove logits_all flag
#13284 merged
May 8, 2025 -
ci : move release workflow to a separate file
#13362 merged
May 8, 2025 -
llama : print size and type of overridden tensors
#13364 merged
May 8, 2025 -
sycl: addressing non-contiguous src1 mul_mats (nc and batched)
#13343 merged
May 8, 2025
25 Pull requests opened by 21 people
-
gguf-py: Optimize `GGUFReader` read-only mode performance
#13378 opened
May 8, 2025 -
musa: restore MUSA graph settings in CMakeLists.txt
#13382 opened
May 8, 2025 -
sycl: simplify bin_bcast_kernel
#13383 opened
May 8, 2025 -
arg : add model catalog
#13385 opened
May 8, 2025 -
grammar: handle misplaced special regex chars [*+?]
#13391 opened
May 8, 2025 -
server : PoC implementation of "interim" server
#13400 opened
May 9, 2025 -
Update README.md for using llama.cpp in Microsoft Word locally
#13401 opened
May 9, 2025 -
Break down main function in llama-server
#13425 opened
May 10, 2025 -
Webui dynamic config
#13429 opened
May 10, 2025 -
llama: Add configuration presets for chat and reranking servers
#13462 opened
May 12, 2025 -
Support Seed-Coder chat template
#13472 opened
May 12, 2025 -
docker : enable RPC for docker images
#13474 opened
May 12, 2025 -
[SYCL] Overcoming workaround for mmap() allocation on Windows
#13482 opened
May 12, 2025 -
feat(server): Add tool call support to WebUI (LLama Server)
#13501 opened
May 13, 2025 -
convert: Swap GLM4 EOS / EOT token
#13505 opened
May 13, 2025 -
webui: Add editing assistant messages (#11849)
#13522 opened
May 14, 2025 -
cuda: set cuda compiler path (#13527)
#13528 opened
May 14, 2025 -
MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now)
#13529 opened
May 14, 2025 -
ci : upgraded oneAPI version in SYCL workflows and dockerfile
#13532 opened
May 14, 2025 -
sycl: disable reorder for sycl mulmat
#13536 opened
May 14, 2025 -
fix: proper error handling for missing elements in messages array (OpenAI compatible backend)
#13540 opened
May 14, 2025 -
Fix build on OpenBSD
#13541 opened
May 14, 2025 -
sycl : reviewing the backend documentation
#13544 opened
May 14, 2025 -
Granite Four
#13550 opened
May 14, 2025 -
webui : improve accessibility for visually impaired people
#13551 opened
May 14, 2025
66 Issues closed by 20 people
-
Eval bug: Jinja not replacing `date_string`
#12729 closed
May 15, 2025 -
Misc. bug: Llama-Quantize.exe broken on win11 since b5298 , but works on/earlier b5215
#13518 closed
May 14, 2025 -
Eval bug: Segmentation fault when using llama-quantize
#13380 closed
May 14, 2025 -
server: Describing pictures with multi models seems to crash the model
#13480 closed
May 14, 2025 -
Question regarding the quantization dimension of the weight such as Q4_K format
#13377 closed
May 14, 2025 -
Eval bug: Qwen3 30B adds spaces to end of each line
#13508 closed
May 14, 2025 -
Compile bug: compile cuda backend error
#13527 closed
May 14, 2025 -
Compile bug: cuda backend compile error
#12893 closed
May 14, 2025 -
Misc. bug: Compute pipeline creation failed when using Flash Attention on macOS/Vulkan
#13450 closed
May 14, 2025 -
csm : implement Sesame-based conversation example
#12392 closed
May 14, 2025 -
Eval bug: llama-qwen2vl-cli --log-disable rather disables the response, not the log
#12407 closed
May 14, 2025 -
Misc. bug: Gibbersish output on AMD Ryzen 9 8945HS w/ Radeon 780M Graphics since commit: 3d82dbcbce2c
#12657 closed
May 14, 2025 -
Misc. bug: since b4800 llama-cli does not prompt and llama-bench shows no results
#13452 closed
May 13, 2025 -
What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA?
#13504 closed
May 13, 2025 -
Misc. bug: can't convert finetuned gemma3 model
#13490 closed
May 13, 2025 -
Eval bug: Phi-4 mini in iOS with xcframework
#12232 closed
May 13, 2025 -
Feature Request: convert_hf_to_gguf.py to support model type Qwen2_5_VLForConditionalGeneration
#12642 closed
May 13, 2025 -
GGML_ASSERT(cur_p->size > 0) failed, or gibberish on DeepSeek V3 0324 (Q2_K_XL), CUDA + CPU
#13461 closed
May 12, 2025 -
Compile bug: nvcc fatal : Unsupported gpu architecture 'compute_120'
#13271 closed
May 12, 2025 -
Eval bug: Qwen2.5-vl在AMD GPU上做图像识别时崩溃(分辨率1242*881)
#13445 closed
May 12, 2025 -
Segfault when submitting image to ggml-org/Qwen2.5-VL-7B-Instruct-GGUF
#13467 closed
May 12, 2025 -
Misc. bug: crashes when calling `llama_state_get_size` on a reranking model
#13463 closed
May 12, 2025 -
Tool call errors with `Expected 'content' to be a string or an array`
#13471 closed
May 12, 2025 -
Misc. bug: rpc-server crash without cache
#13185 closed
May 12, 2025 -
Compile bug: SYCL backend build fail on debug config
#12602 closed
May 12, 2025 -
Misc. bug:
#12623 closed
May 12, 2025 -
Eval bug: mmvq.cu:519: GGML_ASSERT(!src0->view_src) failed
#13437 closed
May 11, 2025 -
Feature Request: Allow disabling `offload_op` for backends by user
#13241 closed
May 11, 2025 -
Compile bug: MinGW32_64 Vulkan Shader
#13419 closed
May 11, 2025 -
Eval bug: run failed when run lora adapter(no merged) on android
#12592 closed
May 11, 2025 -
[New Bitnet Model Support Request] Deepgrove model Bonsai 0.5B - Add Channel Scales
#12598 closed
May 11, 2025 -
Misc. bug: Data check in examples/gguf
#12617 closed
May 11, 2025 -
Eval bug: b5335 break flash attention on 4070
#13430 closed
May 10, 2025 -
ByteDance-Seed/Seed-Coder unsupported?
#13421 closed
May 10, 2025 -
Eval bug: mtmd in server mode crashes on too big image
#13414 closed
May 10, 2025 -
Update server documentation with new mmproj configuration options
#13431 closed
May 10, 2025 -
Misc. bug: Intel container images keep getting `No space left on device` during CI Build
#13052 closed
May 10, 2025 -
Misc. bug: [SYCL] Unexpected "setvars.sh has already been run" warning
#13333 closed
May 10, 2025 -
Eval bug: the swiftui keeps saying the same thing
#12558 closed
May 10, 2025 -
Misc. bug: performance drop with 2x SYCL GPUs
#12575 closed
May 10, 2025 -
-ngl to load ·last n layers· to gpu
#12577 closed
May 10, 2025 -
Compile bug: vulkan-shaders-gen hangs when built with address sanitizers
#12581 closed
May 10, 2025 -
Qwen2.5-vl support and conversion?
#12584 closed
May 10, 2025 -
Eval bug: allocating 114296.55 MiB on device 0: cudaMalloc failed: out of memory
#12586 closed
May 10, 2025 -
server: Bring back multimodal support
#8010 closed
May 9, 2025 -
server : add support for file upload to the Web UI
#11611 closed
May 9, 2025 -
Compile bug: Build breaks with llguidance
#13412 closed
May 9, 2025 -
`CUDA error: invalid configuration argument` for MoEs - `--ubatch-size 8192` exceeds `INT_MAX`
#13376 closed
May 9, 2025 -
Eval bug: mtmd Qwen2.5VL 7B not seeing an image as expected
#13394 closed
May 9, 2025 -
Feature Request: Prefix assistant answer
#11536 closed
May 9, 2025 -
Misc. bug: auto scroll doesn't work in WebUI
#12362 closed
May 9, 2025 -
Eval bug: inference of 32B eats too much memory on ROCM HIP (5x AMD Radeon Instinct Mi50 (gfx906))
#12369 closed
May 9, 2025 -
Feature Request: allow mmap to take advantage of hugepage feature which has 10x speedup
#12444 closed
May 9, 2025 -
Misc. bug: Flash attention on Vulkan
#12526 closed
May 9, 2025 -
Eval bug: seemed it cannot convert theQwen2.5-VL-7B-Instruct, please help advice, Thank you.
#12534 closed
May 9, 2025 -
Eval bug: crash when pooling_type == LLAMA_POOLING_TYPE_MEAN
#12543 closed
May 9, 2025 -
Misc. bug: vulkan: performance regression after fd123cfead49eb32e386e26b8ef7a6d41554dda5
#12553 closed
May 9, 2025 -
Eval bug: Using llama-llava-clip-quantize-cli under CUDA backend conditions will encounter a crash.
#12564 closed
May 9, 2025 -
Misc. bug: The following tests FAILED: 23 - test-arg-parser (Subprocess aborted) main
#13371 closed
May 8, 2025
38 Issues opened by 32 people
-
Misc. bug: missing messages in JSON export via llama-server web UI
#13552 opened
May 14, 2025 -
Misc. bug: Potential out of bound in rerank
#13549 opened
May 14, 2025 -
Misc. bug: -sm row results in gibberish output on HIP (ROCm 6.3.3)
#13545 opened
May 14, 2025 -
Eval bug: nomic-embed-text-v2-moe GGML_ASSERT(pc_type == ...) failed
#13534 opened
May 14, 2025 -
webui: Make the Web UI more accessible for blind users
#13531 opened
May 14, 2025 -
tutorials : list for llama.cpp
#13523 opened
May 14, 2025 -
Research: How to integrate VITA 1.5 for multi-modal GGUF deployment?
#13520 opened
May 14, 2025 -
Eval bug: bizarre Jinja bug when trying to fix Qwen3 tool calling
#13516 opened
May 13, 2025 -
Feature Request: Apple just release Fast-VLM, a very promising set of multimodal language models
#13512 opened
May 13, 2025 -
Misc. bug: llama-cli stopped starting in release b4191 (c9b00a7)
#13498 opened
May 13, 2025 -
kv-cache : improve defrag logic
#13497 opened
May 13, 2025 -
Eval bug: BGE-M3 Embedding model is not accessible
#13494 opened
May 13, 2025 -
Misc. bug: In Windows, llama-bench does not recognize the -ot or --override-tensors parameter.
#13491 opened
May 13, 2025 -
Eval bug: I just finetuned gpt2 model with lora and save it to gguf file but not properly worked
#13489 opened
May 12, 2025 -
Partial offload support for training
#13486 opened
May 12, 2025 -
LoRA training example
#13485 opened
May 12, 2025 -
web UI either doesn't scroll or jumps to the wrong element
#13479 opened
May 12, 2025 -
Eval bug: I cannot run llama 405b on CPU
#13475 opened
May 12, 2025 -
Why mul_mat in ggml slower than llama.cpp?
#13473 opened
May 12, 2025 -
How to start gemma3 multimodal model service using llama_server
#13465 opened
May 12, 2025 -
Phi-4-mini reasoning CRASH!!! (Vulkan)
#13464 opened
May 12, 2025 -
Feature Request: add draft model in llama-bench and more.
#13456 opened
May 11, 2025 -
Eval bug: llama-mtmd-cli doesn't support system prompts
#13454 opened
May 11, 2025 -
Misc. bug: Illegal CUDA memory access in ggml_backend_cuda_cpy_tensor_async
#13449 opened
May 11, 2025 -
Drop support for sentencepiece
#13448 opened
May 11, 2025 -
Compile bug: ld returned 1 exit status (file bigger than 2gb)
#13446 opened
May 11, 2025 -
Eval bug: llama-speculative core dump with Qwen3, GGML_ASSERT(batch.n_tokens > 0) failed
#13433 opened
May 10, 2025 -
Misc. bug: The web UI of llama-server is not displaying correctly.
#13428 opened
May 10, 2025 -
Eval bug: Qwen3-30B-A3B-Q4_K_M: Slows down when using the \no_think mode.
#13427 opened
May 10, 2025 -
Differential mode for llama-bench + plotting code
#13408 opened
May 9, 2025 -
Misc. bug: llama-sampling.cpp:204: GGML_ASSERT(cur_p->size > 0) failed
#13405 opened
May 9, 2025 -
Eval bug: llama-cli, Qwen3 jinja template will break CLI multiturn conversation
#13404 opened
May 9, 2025 -
Eval bug: llama-cli, spurious token added to assistant response
#13402 opened
May 9, 2025 -
Misc. bug: Model not loaded on Android with NDK
#13399 opened
May 9, 2025 -
Misc. bug: invalid regex grammar causes segment violation
#13390 opened
May 8, 2025 -
Compile bug: ninja: build stopped: subcommand failed.
#13375 opened
May 8, 2025 -
CI: editorconfig-checker appears to have made a false positive judgment on "Trailing whitespace"
#13374 opened
May 8, 2025 -
Token Generation Speed Decline with GGUF Models on M3 Ultra
#13373 opened
May 8, 2025
90 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
kv-cache : add SWA support
#13194 commented on
May 14, 2025 • 13 new comments -
sycl: use oneDNN for matrices multiplication
#12972 commented on
May 14, 2025 • 10 new comments -
sycl : Implemented reorder Q4_K mmvq
#13109 commented on
May 15, 2025 • 6 new comments -
cuda: refactored ssm_scan and use CUB
#13291 commented on
May 11, 2025 • 5 new comments -
[CANN]Support OP MUL_MAT_ID
#13042 commented on
May 14, 2025 • 4 new comments -
feat: First pass at llama_kv_cache_hybrid
#13276 commented on
May 14, 2025 • 4 new comments -
Introduce New Lookup-Table(LUT)-Based Matrix Multiplication Method (TMAC)
#13206 commented on
May 14, 2025 • 3 new comments -
[CANN] Simplify the environment variable setting for GGML_CANN_MEM_POOL and GGML_CANN_ASYNC_MODE
#13104 commented on
May 14, 2025 • 3 new comments -
Fix Vulkan glslc invocation command lines
#13289 commented on
May 8, 2025 • 2 new comments -
llama : try loading tensors with pre-computed hashes
#13106 commented on
May 12, 2025 • 2 new comments -
common: add default reranker presets
#13352 commented on
May 9, 2025 • 1 new comment -
Compile bug: ggml-cuda/opt-step-adamw.cu error: identifier "__Poly8x8_t" is undefined on Jetson Orin AGX
#12826 commented on
May 15, 2025 • 0 new comments -
Eval bug: got exception: {"code":500,"message":"Unsupported param: echo","type":"server_error"}
#12591 commented on
May 15, 2025 • 0 new comments -
Misc. bug: ALL gguf models fail to run (no log, docker exit code 139),
#12205 commented on
May 15, 2025 • 0 new comments -
Feature Request: resize an existing context
#11577 commented on
May 15, 2025 • 0 new comments -
llama : initial Mamba-2 support
#9126 commented on
May 14, 2025 • 0 new comments -
[Draft] Tensor Parallel support to llama.cpp
#9648 commented on
May 14, 2025 • 0 new comments -
Allow user to compile with any cuda version using github actions
#10928 commented on
May 12, 2025 • 0 new comments -
tool-call: add support for tool-calls using Model Context Protocol
#11556 commented on
May 13, 2025 • 0 new comments -
[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs
#12063 commented on
May 13, 2025 • 0 new comments -
CUDA: implementation of mul_mat_id
#12859 commented on
May 15, 2025 • 0 new comments -
what *tool/framework* to use if testing performance of .gguf models
#12901 commented on
May 15, 2025 • 0 new comments -
Misc. bug: llama-bench --tensor-split handling is broken
#12917 commented on
May 15, 2025 • 0 new comments -
Compile bug: macro "DECL_FATTN_MMA_F16_CASE" requires 3 arguments, but only 2 given
#12921 commented on
May 15, 2025 • 0 new comments -
Misc. bug: llama-server "terminate called after throwing an instance of 'std::runtime_error'"
#12939 commented on
May 15, 2025 • 0 new comments -
Model conversion issue
#12941 commented on
May 15, 2025 • 0 new comments -
Feature Request: Granite 4 Support
#13275 commented on
May 14, 2025 • 0 new comments -
Eval bug: Qwen3, failed to parse chat template (jinja)
#13178 commented on
May 14, 2025 • 0 new comments -
CUDA: update build CTK version to 12.8
#13360 commented on
May 14, 2025 • 0 new comments -
SYCL: Fix test-backend-ops crashes with SYCL-Graph
#13357 commented on
May 12, 2025 • 0 new comments -
[Perf] [CPU] eliminate redundant memory access in group query attention
#13319 commented on
May 12, 2025 • 0 new comments -
Added dynamic context size. This is perfect for servers running llama models as a service.
#13295 commented on
May 11, 2025 • 0 new comments -
Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client
#13196 commented on
May 12, 2025 • 0 new comments -
[CANN] Update CANN model support status
#13162 commented on
May 14, 2025 • 0 new comments -
musa: add support for muBLAS and MMA
#13149 commented on
May 8, 2025 • 0 new comments -
quantize: Handle user-defined pruning of whole layers (blocks)
#13037 commented on
May 11, 2025 • 0 new comments -
convert : write tensors in parallel
#12837 commented on
May 8, 2025 • 0 new comments -
opencl: fix couple crashes
#12795 commented on
May 14, 2025 • 0 new comments -
Update llama-quant.cpp llama_tensor_get_type with DeepSeek friendly modifications
#12727 commented on
May 8, 2025 • 0 new comments -
imatrix: add option to display importance score statistics for a given imatrix file
#12718 commented on
May 12, 2025 • 0 new comments -
tts : implement sesame CSM + Mimi decoder
#12648 commented on
May 12, 2025 • 0 new comments -
opencl: Add support for multiple devices
#12622 commented on
May 14, 2025 • 0 new comments -
`server`: streaming of tool calls and thoughts when `--jinja` is on
#12379 commented on
May 14, 2025 • 0 new comments -
PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp
#12326 commented on
May 11, 2025 • 0 new comments -
vulkan: optimization proposals for coopmat1 mul_mm
#12260 commented on
May 10, 2025 • 0 new comments -
Misc. bug: llama-quantize clobbers input file + crashes when output file matches
#12753 commented on
May 14, 2025 • 0 new comments -
Compile bug: llama.cpp-master/ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp:80:54:error '_mm256_set_m128i' was not declared in this scope
#11385 commented on
May 11, 2025 • 0 new comments -
Prompt eval is 5x slower than in Ollama and maxes out the CPU
#12237 commented on
May 11, 2025 • 0 new comments -
Feature Request: Slim Attention (lossless 2x reduction in KV cache size)
#12359 commented on
May 11, 2025 • 0 new comments -
Eval bug: Accuracy is dropped when I convert model to gguf. Qwen2_VL_7B_Instruct
#12538 commented on
May 11, 2025 • 0 new comments -
Misc. bug: convert_hf_to_gguf.py fails to convert the model of architecture T5ForConditionalGeneration
#12862 commented on
May 11, 2025 • 0 new comments -
Eval bug: Assertion _LIBCPP_ASSERT_VALID_ELEMENT_ACCESS while using a particular model
#12877 commented on
May 11, 2025 • 0 new comments -
Eval bug: add support for https://huggingface.co/
#12884 commented on
May 11, 2025 • 0 new comments -
Eval bug: moonshotai/Moonlight-16B-A3B-Instruct
#12880 commented on
May 11, 2025 • 0 new comments -
Misc. bug: llama-server webui overriding command line parameters
#13277 commented on
May 10, 2025 • 0 new comments -
Eval bug: Regex
#13347 commented on
May 10, 2025 • 0 new comments -
Compile bug: Build failure for Intel oneMKL on Windows
#12478 commented on
May 10, 2025 • 0 new comments -
Add support for gemma 3 in the server?
#12762 commented on
May 10, 2025 • 0 new comments -
CUDA performance bug when two cards are visible and only one is used
#12838 commented on
May 10, 2025 • 0 new comments -
Eval bug: llama-server can only load 27 layers into the Vulkan, but llama-run can load 33 layers for no apparent reason
#12840 commented on
May 10, 2025 • 0 new comments -
Eval bug: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: llama.context_length
#12857 commented on
May 10, 2025 • 0 new comments -
Compile bug: I tried compiling llama.cpp for HIP on my system (elementaryOS 8/ubuntu 24.04, rocm 6.4.0, gfx1100) using the installation guide
#13340 commented on
May 9, 2025 • 0 new comments -
Feature Request: add jina embeddings model availible convert to gguf
#12327 commented on
May 9, 2025 • 0 new comments -
OpenCL: Performance comparison depending on gpu_offloads
#12810 commented on
May 9, 2025 • 0 new comments -
Llama 4 convert_hf_to_gguf.py tokenizer error
#12819 commented on
May 9, 2025 • 0 new comments -
Misc. bug: Qwen 3.0 "enable_thinking" parameter not working
#13160 commented on
May 8, 2025 • 0 new comments -
Eval bug: Qwen3 Q4_0 not working with SYCL
#13163 commented on
May 8, 2025 • 0 new comments -
changelog : `libllama` API
#9289 commented on
May 8, 2025 • 0 new comments -
Misc. bug: Inconsistent Vulkan segfault
#10528 commented on
May 14, 2025 • 0 new comments -
(Discussion) Improve usability of llama-server
#13367 commented on
May 14, 2025 • 0 new comments -
Feature Request: Qwen2.5-Omni
#12673 commented on
May 14, 2025 • 0 new comments -
Eval bug: ggml_vulkan: Device memory allocation of size N failed with ub > 4096 and c > 4096 and b > 4096
#12817 commented on
May 14, 2025 • 0 new comments -
Eval bug: ROCm error: CUBLAS_STATUS_INTERNAL_ERROR
#12878 commented on
May 14, 2025 • 0 new comments -
Misc. bug: gguf-my-repo doesn't work - [Errno 2] No such file or directory: './llama.cpp/llama-quantize'
#12925 commented on
May 14, 2025 • 0 new comments -
Misc. bug: The llama-server not read the "--keep" param that user input in the cli
#12927 commented on
May 14, 2025 • 0 new comments -
Eval bug: Can't run Qwen3-32B Q4_K_XL
#13298 commented on
May 13, 2025 • 0 new comments -
Move gguf fuzzers to the llama.cpp repository
#11514 commented on
May 13, 2025 • 0 new comments -
Feature Request: moondream2 vlm support in mtmd
#13332 commented on
May 13, 2025 • 0 new comments -
Feature Request: Add support of convert.py for model Qwen2.5-Omni-7B
#12641 commented on
May 13, 2025 • 0 new comments -
Feature Request: XiaomiMiMo/MiMo-7B-RL
#13218 commented on
May 13, 2025 • 0 new comments -
Qwen3-8B and other models generate garbage output / repeat tokens (GGGGGG...) in llama.cpp via LM Studio Vulkan backend
#13310 commented on
May 13, 2025 • 0 new comments -
Feature Request: Free up VRAM when llama-server not in use
#11703 commented on
May 13, 2025 • 0 new comments -
Feature Request: Qwen 2.5 VL
#11483 commented on
May 12, 2025 • 0 new comments -
Feature Request: NUMA-aware MoE Expert Allocation for Improved Performanc
#11333 commented on
May 12, 2025 • 0 new comments -
Eval bug: Crash in trim method
#12710 commented on
May 12, 2025 • 0 new comments -
multiple_choice_score : task 17 does not fit in the context window
#12905 commented on
May 12, 2025 • 0 new comments -
How to use *chat_template* with .gguf models ? (tokenizer_name not implemented)
#12897 commented on
May 12, 2025 • 0 new comments -
Misc. bug: Completions hang after CUDA error, but health endpoint reports all OK
#13281 commented on
May 11, 2025 • 0 new comments -
changelog : `llama-server` REST API
#9291 commented on
May 11, 2025 • 0 new comments -
Feature Request: Support for Qwen2-VL
#9246 commented on
May 11, 2025 • 0 new comments