Releases · ggml-org/llama.cpp

11 Nov 01:06

7bef684

b7017 Latest

Latest

models : move build_inp_out_ids outside loop (#17151)

* move build_inp_out_ids outside loop

* realign

Assets 16

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-11-11T01:06:09Z
llama-b7017-bin-macos-arm64.zip

sha256:97f0581a115e889c620938b28a4de2bc50f7bdc313aa550f84d20a9db67dd078

11.1 MB 2025-11-11T01:06:24Z
llama-b7017-bin-macos-x64.zip

sha256:f612a4a6e01be29a9696d2983558ff64f0ea9ee9cbde97f93e71eb99d11174d1

28.3 MB 2025-11-11T01:06:25Z
llama-b7017-bin-ubuntu-s390x.zip

sha256:03806b40dda05354606f007c7027a095b4d3aa439aafaef4fddfb5cc994faee8

12.7 MB 2025-11-11T01:06:27Z
llama-b7017-bin-ubuntu-vulkan-x64.zip

sha256:db3e1e9761841f501b69d6ef1d5c5aaa12efd07bacb3fe06c83cdee6d9b9ffce

26.9 MB 2025-11-11T01:06:29Z
llama-b7017-bin-ubuntu-x64.zip

sha256:a209dc5ebc5240f5c3a3e57e74101482f7e1a9ee854ad1421b6807025f9b5002

13.1 MB 2025-11-11T01:06:30Z
llama-b7017-bin-win-cpu-arm64.zip

sha256:15075739548b67b5f6aa6e2e5808d6296f6c52b5eb64391286ad8036c251ecad

11.2 MB 2025-11-11T01:06:31Z
llama-b7017-bin-win-cpu-x64.zip

sha256:b2da8e78678043ae3ce9d8bc0bfd182e21a6ac8fc5d155b7d8eed3d51402902d

14.3 MB 2025-11-11T01:06:33Z
llama-b7017-bin-win-cuda-12.4-x64.zip

sha256:b2a21e9a4937301b6fbd0b68ec86d0d62a272a876cfbed21f98992b5135cee11

174 MB 2025-11-11T01:06:34Z
llama-b7017-bin-win-hip-radeon-x64.zip

sha256:a8d5c2954d8390b6e1bdc40323e668f3e3991a6feb4d5e5179f04809efdaaa9b

324 MB 2025-11-11T01:06:42Z
Source code (zip)

2025-11-10T21:55:30Z
Source code (tar.gz)

2025-11-10T21:55:30Z

11 Nov 00:05

github-actions

b7016

395e286

b7016

cpu: skip NOPs to avoid barriers (#17133)

* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty

Assets 16

10 Nov 23:15

github-actions

b7015

13730c1

b7015

metal : cap threadgroups size of set_rows (#17146)

Assets 16

10 Nov 22:52

github-actions

b7014

967eb4b

b7014

ggml-cpu : inspect -march and -mcpu to found the CPU (#16333)

Signed-off-by: Adrien Gallouët <[email protected]>

Assets 16

10 Nov 20:21

github-actions

b7013

f117be1

b7013

vulkan: check glslc executable string (#17144)

Assets 16

10 Nov 19:49

github-actions

b7012

85234a4

b7012

vulkan: fix validation issue introduced by #16868 (#17145)

Assets 16

10 Nov 18:37

github-actions

b7011

0c74f32

b7011

memory: Hybrid context shift (#17009)

* feat(memory): Only fail partial erasure of recurrent tail

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(main): Check the output of seq_rm for prefix matching

This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(memory): Fix condition for partial erasure failure if p0 > pos

Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: compilade <[email protected]>

* style: Fix extra parens

Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: Georgi Gerganov <[email protected]>

* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: compilade <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Assets 16

10 Nov 17:28

github-actions

b7010

c27efd2

b7010

metal : enable tensor API for A19 (#17087)

Assets 16

10 Nov 14:52

github-actions

b7009

df70bed

b7009

arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_do…

Assets 16

10 Nov 11:44

github-actions

b7008

f914544

b7008

batched-bench : add "separate text gen" mode (#17103)

Assets 16

Releases: ggml-org/llama.cpp

b7017

Uh oh!

b7016

Uh oh!

b7015

Uh oh!

b7014

Uh oh!

b7013

Uh oh!

b7012

Uh oh!

b7011

Uh oh!

b7010

Uh oh!

b7009

Uh oh!

b7008

Uh oh!