Skip to content

Releases: ggml-org/llama.cpp

b7017

11 Nov 01:06
7bef684

Choose a tag to compare

models : move build_inp_out_ids outside loop (#17151)

* move build_inp_out_ids outside loop

* realign

b7016

11 Nov 00:05
395e286

Choose a tag to compare

cpu: skip NOPs to avoid barriers (#17133)

* cpu: skip NOPs to avoid barriers

* cpu: use ggml_op_is_empty

b7015

10 Nov 23:15
13730c1

Choose a tag to compare

metal : cap threadgroups size of set_rows (#17146)

b7014

10 Nov 22:52
967eb4b

Choose a tag to compare

ggml-cpu : inspect -march and -mcpu to found the CPU (#16333)

Signed-off-by: Adrien Gallouët <[email protected]>

b7013

10 Nov 20:21
f117be1

Choose a tag to compare

vulkan: check glslc executable string (#17144)

b7012

10 Nov 19:49
85234a4

Choose a tag to compare

vulkan: fix validation issue introduced by #16868 (#17145)

b7011

10 Nov 18:37
0c74f32

Choose a tag to compare

memory: Hybrid context shift (#17009)

* feat(memory): Only fail partial erasure of recurrent tail

The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.

There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(main): Check the output of seq_rm for prefix matching

This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>

* fix(memory): Fix condition for partial erasure failure if p0 > pos

Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: compilade <[email protected]>

* style: Fix extra parens

Signed-off-by: Gabe Goodhart <[email protected]>

Co-authored-by: Georgi Gerganov <[email protected]>

* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear

https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: compilade <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

b7010

10 Nov 17:28
c27efd2

Choose a tag to compare

metal : enable tensor API for A19 (#17087)

b7009

10 Nov 14:52
df70bed

Choose a tag to compare

arm64: add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_do…

b7008

10 Nov 11:44
f914544

Choose a tag to compare

batched-bench : add "separate text gen" mode (#17103)