Skip to content

kv-cache : fix out-of-bounds view during reserve graph #13547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 14, 2025

Conversation

ggerganov
Copy link
Member

fix #13359

See the comments added in the code.

@ggerganov ggerganov requested a review from slaren May 14, 2025 19:06
Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment in the declaration of set_full explaining the purpose would also help.

Comment on lines 445 to 446
// when simulating a full KV cache, the specific value of the "head" pointer is not important because we are not
// going to write any data - we just want to measure the memory needed by the graph in such state.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am misunderstanding, but setting it to zero should be necessary to get the biggest KQ possible, right? We would want every token in the cache to be used in the attention to estimate the worst case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The K*Q size is not determined by the head, but rather the n (or also referred to as n_kv in some places). The head is only used to offset the view into which we store the new K and V data from the current batch here:

llama.cpp/src/llama-graph.cpp

Lines 1421 to 1450 in 9745d5f

// store to KV cache
{
const auto kv_head = kv_self->head;
GGML_ASSERT(kv_self->size == n_ctx);
ggml_tensor * k_cache_view = ggml_view_1d(ctx0, kv_self->k_l[il], n_tokens*n_embd_k_gqa, ggml_row_size(kv_self->k_l[il]->type, n_embd_k_gqa)*kv_head);
//cb(k_cache_view, "k_cache_view", il);
// note: storing RoPE-ed version of K in the KV cache
ggml_build_forward_expand(gf, ggml_cpy(ctx0, k_cur, k_cache_view));
v_cur = ggml_reshape_2d(ctx0, v_cur, n_embd_v_gqa, n_tokens);
ggml_tensor * v_cache_view = nullptr;
if (!v_trans) {
v_cache_view = ggml_view_1d(ctx0, kv_self->v_l[il], n_tokens*n_embd_v_gqa, ggml_row_size(kv_self->v_l[il]->type, n_embd_v_gqa)*kv_head);
} else {
// note: the V cache is transposed when not using flash attention
v_cache_view = ggml_view_2d(ctx0, kv_self->v_l[il], n_tokens, n_embd_v_gqa,
( n_ctx)*ggml_element_size(kv_self->v_l[il]),
(kv_head)*ggml_element_size(kv_self->v_l[il]));
v_cur = ggml_transpose(ctx0, v_cur);
}
//cb(v_cache_view, "v_cache_view", il);
ggml_build_forward_expand(gf, ggml_cpy(ctx0, v_cur, v_cache_view));
}

The head should never affect the shapes of the tensors - just the offsets of the K/V views.

@ggerganov
Copy link
Member Author

A comment in the declaration of set_full explaining the purpose would also help.

There is a comment in the abstract interface:

// simulate full cache, used for allocating worst-case compute buffers
virtual void set_full() = 0;

@ggerganov ggerganov force-pushed the gg/kv-cache-fix-reserve branch from 55b5ae1 to 494757e Compare May 14, 2025 19:47
@ggerganov ggerganov merged commit e3a9421 into master May 14, 2025
1 check passed
@ggerganov ggerganov deleted the gg/kv-cache-fix-reserve branch May 14, 2025 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants