Skip to content

MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) #13529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

jukofyork
Copy link
Collaborator

@jukofyork jukofyork commented May 14, 2025

From #13435:

Going forward I would suggest re-writing the code in other backends as well to use only the K tensor. The KV cache size could then be reduced by ~47% by simply not allocating and filling the V cache. At least as long as FlashAttention is used this should be relatively simple. So for a first version it would I think be fine to only deduplicate K and V if FA is used.

I've only tested this to work with #13435 for now, but it should still work with the other backends' flash attention implementations so long as they don't assume that the V-cache they are passed is contiguous:

    } else {
        // note: MLA with flash attention now uses the last 512 elements of K-cache in place of a V-cache
        v = ggml_view_3d(ctx0, kv_self->k_l[il],
                n_embd_head_v, n_kv, n_head_kv,
                ggml_row_size(kv_self->k_l[il]->type, n_embd_k_gqa),
                ggml_row_size(kv_self->k_l[il]->type, n_embd_head_k),
                n_embd_head_k-n_embd_head_v); // offset by n_rot elements
    }

The full context of 160k tokens now takes up less than 11GB:

llama_kv_cache_unified: kv_size = 163840, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0, padding = 256
llama_kv_cache_unified:      CUDA0 KV buffer size = 10980.00 MiB
llama_kv_cache_unified: KV self size  = 10980.00 MiB, K (f16): 10980.00 MiB, V (f16):    0.00 MiB

! 😮


I've just disabled context shifting for now, as like I said in the other post; I'm not at all confident I can cleanly change all that is required to deal with the empty V-cache:

image


I will leave it as a draft for now, as the other backends' flash attention implementations need to either:

A. Check they can work with the non-contiguous V-cache passed to them.
B. Copy @JohannesGaessler's strategy of only using the last 512 elements of the K-cache in place of the V-cache.

I think (B) is preferable, as there is likely to be some significant gains possible regarding CPU cache (re-)use, etc.

@jukofyork jukofyork changed the title MLA + FA now only uses K-cache - 47% saving on KV-cache szie (only for use with #13435 for now) MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) May 14, 2025
@pwilkin
Copy link
Contributor

pwilkin commented May 14, 2025

Wow, this is pretty huge. Would go a long way towards supporting long contexts on potato devices.

@jukofyork
Copy link
Collaborator Author

You'll need to use something like this to get the merge of the 2 PRs:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

git fetch origin pull/13435/head:pr-13435
git fetch origin pull/13529/head:pr-13529

git merge pr-13435 --no-edit
git merge pr-13529 --no-edit

(there may be a better way, but I'm pretty dumb when it comes to using git...)

@jukofyork
Copy link
Collaborator Author

jukofyork commented May 14, 2025

Wow, this is pretty huge. Would go a long way towards supporting long contexts on potato devices.

Yeah, it's nuts: I can now get 160k context, with a ubatch of 4096 using a Q6_K model (with non-shared experts stored in RAM), all on a single 32GB RTX 5000 Ada card!

@Panchovix
Copy link

#13435 got merged, so this could be converted back to PR now I think.

@jukofyork
Copy link
Collaborator Author

#13435 got merged, so this could be converted back to PR now I think.

It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did.

@ggerganov
Copy link
Member

#13435 got merged, so this could be converted back to PR now I think.

It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did.

The CPU implementation does not seem to support it. This command generates junk for me:

make -j && ./bin/llama-cli -m ../models/deepseek-v2-lite-chat/ggml-model-q8_0-mla.gguf -no-cnv -p "I believe the meaning of life is" --top-k 1 -n 32 -fa -dev none

Maybe look into fixing it and adding test-backend-ops tests that verify this use case. It will make it easier to support it in the rest of the backends.

@bartowski1182
Copy link
Contributor

Does this have any speed implications by not storing the values ?

@jukofyork
Copy link
Collaborator Author

#13435 got merged, so this could be converted back to PR now I think.

It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did.

The CPU implementation does not seem to support it. This command generates junk for me:

make -j && ./bin/llama-cli -m ../models/deepseek-v2-lite-chat/ggml-model-q8_0-mla.gguf -no-cnv -p "I believe the meaning of life is" --top-k 1 -n 32 -fa -dev none

Maybe look into fixing it and adding test-backend-ops tests that verify this use case. It will make it easier to support it in the rest of the backends.

Yeah, I think having a non-cont V-cache is likely to break most of the backends.

I'll try and have a look at fixing this and adding the test today or tomorrow.

@jukofyork
Copy link
Collaborator Author

jukofyork commented May 15, 2025

Does this have any speed implications by not storing the values ?

If the backends are rewritten to only use the K-cache, then it could have a big performance improvement for some of the backends due to not having to access the V-cache in memory (eg: a CPU's L3-cache will only have to store K-cache elements).

If the backends are just fixed to use the non-contiguous view of the K-cache in place of the V-cache, there will likely be some degree of performance reduction though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants