-
Notifications
You must be signed in to change notification settings - Fork 11.8k
kv-cache : fix out-of-bounds view during reserve graph #13547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment in the declaration of set_full
explaining the purpose would also help.
src/llama-kv-cache.cpp
Outdated
// when simulating a full KV cache, the specific value of the "head" pointer is not important because we are not | ||
// going to write any data - we just want to measure the memory needed by the graph in such state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am misunderstanding, but setting it to zero should be necessary to get the biggest KQ
possible, right? We would want every token in the cache to be used in the attention to estimate the worst case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The K*Q size is not determined by the head
, but rather the n
(or also referred to as n_kv
in some places). The head
is only used to offset the view into which we store the new K and V data from the current batch here:
Lines 1421 to 1450 in 9745d5f
// store to KV cache | |
{ | |
const auto kv_head = kv_self->head; | |
GGML_ASSERT(kv_self->size == n_ctx); | |
ggml_tensor * k_cache_view = ggml_view_1d(ctx0, kv_self->k_l[il], n_tokens*n_embd_k_gqa, ggml_row_size(kv_self->k_l[il]->type, n_embd_k_gqa)*kv_head); | |
//cb(k_cache_view, "k_cache_view", il); | |
// note: storing RoPE-ed version of K in the KV cache | |
ggml_build_forward_expand(gf, ggml_cpy(ctx0, k_cur, k_cache_view)); | |
v_cur = ggml_reshape_2d(ctx0, v_cur, n_embd_v_gqa, n_tokens); | |
ggml_tensor * v_cache_view = nullptr; | |
if (!v_trans) { | |
v_cache_view = ggml_view_1d(ctx0, kv_self->v_l[il], n_tokens*n_embd_v_gqa, ggml_row_size(kv_self->v_l[il]->type, n_embd_v_gqa)*kv_head); | |
} else { | |
// note: the V cache is transposed when not using flash attention | |
v_cache_view = ggml_view_2d(ctx0, kv_self->v_l[il], n_tokens, n_embd_v_gqa, | |
( n_ctx)*ggml_element_size(kv_self->v_l[il]), | |
(kv_head)*ggml_element_size(kv_self->v_l[il])); | |
v_cur = ggml_transpose(ctx0, v_cur); | |
} | |
//cb(v_cache_view, "v_cache_view", il); | |
ggml_build_forward_expand(gf, ggml_cpy(ctx0, v_cur, v_cache_view)); | |
} |
The head
should never affect the shapes of the tensors - just the offsets of the K/V views.
There is a comment in the abstract interface: llama.cpp/src/llama-kv-cache.h Lines 35 to 38 in 9745d5f
|
55b5ae1
to
494757e
Compare
fix #13359
See the comments added in the code.