kv-cache : fix out-of-bounds view during reserve graph #13547

ggerganov · 2025-05-14T19:06:10Z

See the comments added in the code.

ggml-ci

slaren

A comment in the declaration of set_full explaining the purpose would also help.

slaren · 2025-05-14T19:13:40Z

src/llama-kv-cache.cpp

+    // when simulating a full KV cache, the specific value of the "head" pointer is not important because we are not
+    //   going to write any data - we just want to measure the memory needed by the graph in such state.


Maybe I am misunderstanding, but setting it to zero should be necessary to get the biggest KQ possible, right? We would want every token in the cache to be used in the attention to estimate the worst case.

The K*Q size is not determined by the head, but rather the n (or also referred to as n_kv in some places). The head is only used to offset the view into which we store the new K and V data from the current batch here:

llama.cpp/src/llama-graph.cpp

Lines 1421 to 1450 in 9745d5f

// store to KV cache

{

const auto kv_head = kv_self->head;

GGML_ASSERT(kv_self->size == n_ctx);

ggml_tensor * k_cache_view = ggml_view_1d(ctx0, kv_self->k_l[il], n_tokens*n_embd_k_gqa, ggml_row_size(kv_self->k_l[il]->type, n_embd_k_gqa)*kv_head);

//cb(k_cache_view, "k_cache_view", il);

// note: storing RoPE-ed version of K in the KV cache

ggml_build_forward_expand(gf, ggml_cpy(ctx0, k_cur, k_cache_view));

v_cur = ggml_reshape_2d(ctx0, v_cur, n_embd_v_gqa, n_tokens);

ggml_tensor * v_cache_view = nullptr;

if (!v_trans) {

v_cache_view = ggml_view_1d(ctx0, kv_self->v_l[il], n_tokens*n_embd_v_gqa, ggml_row_size(kv_self->v_l[il]->type, n_embd_v_gqa)*kv_head);

} else {

// note: the V cache is transposed when not using flash attention

v_cache_view = ggml_view_2d(ctx0, kv_self->v_l[il], n_tokens, n_embd_v_gqa,

( n_ctx)*ggml_element_size(kv_self->v_l[il]),

(kv_head)*ggml_element_size(kv_self->v_l[il]));

v_cur = ggml_transpose(ctx0, v_cur);

}

//cb(v_cache_view, "v_cache_view", il);

ggml_build_forward_expand(gf, ggml_cpy(ctx0, v_cur, v_cache_view));

}

The head should never affect the shapes of the tensors - just the offsets of the K/V views.

ggerganov · 2025-05-14T19:40:09Z

A comment in the declaration of set_full explaining the purpose would also help.

There is a comment in the abstract interface:

llama.cpp/src/llama-kv-cache.h

Lines 35 to 38 in 9745d5f

    
           // simulate full cache, used for allocating worst-case compute buffers 
        
           virtual void set_full() = 0;

ggerganov added 2 commits May 14, 2025 22:00

kv-cache : fix reserve graph out-of-bounds access

7f14ac1

ggml-ci

cont : add comment

714bef0

ggerganov requested a review from slaren May 14, 2025 19:06

cont : fix comments [no ci]

9745d5f

slaren approved these changes May 14, 2025

View reviewed changes

slaren reviewed May 14, 2025

View reviewed changes

cont : more correct comment [no ci]

494757e

ggerganov force-pushed the gg/kv-cache-fix-reserve branch from 55b5ae1 to 494757e Compare May 14, 2025 19:47

ggerganov merged commit e3a9421 into master May 14, 2025
1 check passed

ggerganov deleted the gg/kv-cache-fix-reserve branch May 14, 2025 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : fix out-of-bounds view during reserve graph #13547

kv-cache : fix out-of-bounds view during reserve graph #13547

Uh oh!

ggerganov commented May 14, 2025

Uh oh!

slaren left a comment

Uh oh!

slaren May 14, 2025

Uh oh!

ggerganov May 14, 2025

Uh oh!

ggerganov commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!

		// when simulating a full KV cache, the specific value of the "head" pointer is not important because we are not
		// going to write any data - we just want to measure the memory needed by the graph in such state.

	// store to KV cache
	{
	const auto kv_head = kv_self->head;

	GGML_ASSERT(kv_self->size == n_ctx);

	ggml_tensor * k_cache_view = ggml_view_1d(ctx0, kv_self->k_l[il], n_tokensn_embd_k_gqa, ggml_row_size(kv_self->k_l[il]->type, n_embd_k_gqa)kv_head);
	//cb(k_cache_view, "k_cache_view", il);

	// note: storing RoPE-ed version of K in the KV cache
	ggml_build_forward_expand(gf, ggml_cpy(ctx0, k_cur, k_cache_view));

	v_cur = ggml_reshape_2d(ctx0, v_cur, n_embd_v_gqa, n_tokens);

	ggml_tensor * v_cache_view = nullptr;

	if (!v_trans) {
	v_cache_view = ggml_view_1d(ctx0, kv_self->v_l[il], n_tokensn_embd_v_gqa, ggml_row_size(kv_self->v_l[il]->type, n_embd_v_gqa)kv_head);
	} else {
	// note: the V cache is transposed when not using flash attention
	v_cache_view = ggml_view_2d(ctx0, kv_self->v_l[il], n_tokens, n_embd_v_gqa,
	( n_ctx)*ggml_element_size(kv_self->v_l[il]),
	(kv_head)*ggml_element_size(kv_self->v_l[il]));

	v_cur = ggml_transpose(ctx0, v_cur);
	}
	//cb(v_cache_view, "v_cache_view", il);

	ggml_build_forward_expand(gf, ggml_cpy(ctx0, v_cur, v_cache_view));
	}

kv-cache : fix out-of-bounds view during reserve graph #13547

kv-cache : fix out-of-bounds view during reserve graph #13547

Uh oh!

Conversation

ggerganov commented May 14, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

slaren May 14, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov May 14, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!