MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) #13529

jukofyork · 2025-05-14T07:38:39Z

Going forward I would suggest re-writing the code in other backends as well to use only the K tensor. The KV cache size could then be reduced by ~47% by simply not allocating and filling the V cache. At least as long as FlashAttention is used this should be relatively simple. So for a first version it would I think be fine to only deduplicate K and V if FA is used.

I've only tested this to work with #13435 for now, but it should still work with the other backends' flash attention implementations so long as they don't assume that the V-cache they are passed is contiguous:

    } else {
        // note: MLA with flash attention now uses the last 512 elements of K-cache in place of a V-cache
        v = ggml_view_3d(ctx0, kv_self->k_l[il],
                n_embd_head_v, n_kv, n_head_kv,
                ggml_row_size(kv_self->k_l[il]->type, n_embd_k_gqa),
                ggml_row_size(kv_self->k_l[il]->type, n_embd_head_k),
                n_embd_head_k-n_embd_head_v); // offset by n_rot elements
    }

The full context of 160k tokens now takes up less than 11GB:

llama_kv_cache_unified: kv_size = 163840, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0, padding = 256
llama_kv_cache_unified:      CUDA0 KV buffer size = 10980.00 MiB
llama_kv_cache_unified: KV self size  = 10980.00 MiB, K (f16): 10980.00 MiB, V (f16):    0.00 MiB

! 😮

I've just disabled context shifting for now, as like I said in the other post; I'm not at all confident I can cleanly change all that is required to deal with the empty V-cache:

I will leave it as a draft for now, as the other backends' flash attention implementations need to either:

A. Check they can work with the non-contiguous V-cache passed to them.
B. Copy @JohannesGaessler's strategy of only using the last 512 elements of the K-cache in place of the V-cache.

I think (B) is preferable, as there is likely to be some significant gains possible regarding CPU cache (re-)use, etc.

…/llama.cpp into mla-fa-disable-v-cache

pwilkin · 2025-05-14T07:46:13Z

Wow, this is pretty huge. Would go a long way towards supporting long contexts on potato devices.

jukofyork · 2025-05-14T07:53:20Z

You'll need to use something like this to get the merge of the 2 PRs:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

git fetch origin pull/13435/head:pr-13435
git fetch origin pull/13529/head:pr-13529

git merge pr-13435 --no-edit
git merge pr-13529 --no-edit

(there may be a better way, but I'm pretty dumb when it comes to using git...)

jukofyork · 2025-05-14T08:01:59Z

Wow, this is pretty huge. Would go a long way towards supporting long contexts on potato devices.

Yeah, it's nuts: I can now get 160k context, with a ubatch of 4096 using a Q6_K model (with non-shared experts stored in RAM), all on a single 32GB RTX 5000 Ada card!

Panchovix · 2025-05-14T14:18:04Z

#13435 got merged, so this could be converted back to PR now I think.

jukofyork · 2025-05-14T15:43:19Z

#13435 got merged, so this could be converted back to PR now I think.

It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did.

ggerganov · 2025-05-14T16:17:02Z

#13435 got merged, so this could be converted back to PR now I think.

It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did.

The CPU implementation does not seem to support it. This command generates junk for me:

make -j && ./bin/llama-cli -m ../models/deepseek-v2-lite-chat/ggml-model-q8_0-mla.gguf -no-cnv -p "I believe the meaning of life is" --top-k 1 -n 32 -fa -dev none

Maybe look into fixing it and adding test-backend-ops tests that verify this use case. It will make it easier to support it in the rest of the backends.

bartowski1182 · 2025-05-15T00:31:10Z

Does this have any speed implications by not storing the values ?

jukofyork · 2025-05-15T12:15:48Z

#13435 got merged, so this could be converted back to PR now I think.

It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did.

The CPU implementation does not seem to support it. This command generates junk for me:
make -j && ./bin/llama-cli -m ../models/deepseek-v2-lite-chat/ggml-model-q8_0-mla.gguf -no-cnv -p "I believe the meaning of life is" --top-k 1 -n 32 -fa -dev none
Maybe look into fixing it and adding test-backend-ops tests that verify this use case. It will make it easier to support it in the rest of the backends.

Yeah, I think having a non-cont V-cache is likely to break most of the backends.

I'll try and have a look at fixing this and adding the test today or tomorrow.

jukofyork · 2025-05-15T12:19:54Z

Does this have any speed implications by not storing the values ?

If the backends are rewritten to only use the K-cache, then it could have a big performance improvement for some of the backends due to not having to access the V-cache in memory (eg: a CPU's L3-cache will only have to store K-cache elements).

If the backends are just fixed to use the non-contiguous view of the K-cache in place of the V-cache, there will likely be some degree of performance reduction though.

jukofyork and others added 11 commits May 14, 2025 06:43

Initial commit

125ef32

Fixed wrong stride. Use v_trans to detect use of FA.

0dccd87

Fixed typo using v_l instead of k_l

306ea4b

Add cont for debugging

c3cc463

Tidied up to use is_mla

676b2db

Fixed redundant code

b81e2e4

Merge branch 'ggml-org:master' into mla-fa-disable-v-cache

8576c72

Better is_mla test

7681876

Merge branch 'mla-fa-disable-v-cache' of https://github.com/jukofyork…

7d13fc2

…/llama.cpp into mla-fa-disable-v-cache

Added back K-cache view for v instead of nullptr

1df57db

Don't test for DEEPSEEK2 arch only as others might use MLA in future

271560e

jukofyork changed the title ~~MLA + FA now only uses K-cache - 47% saving on KV-cache szie (only for use with #13435 for now)~~ MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) May 14, 2025

jukofyork mentioned this pull request May 14, 2025

CUDA: faster Deepseek FA, add Turing support #13435

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) #13529

MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) #13529

jukofyork commented May 14, 2025 •

edited

Loading

pwilkin commented May 14, 2025

jukofyork commented May 14, 2025

jukofyork commented May 14, 2025 •

edited

Loading

Panchovix commented May 14, 2025

jukofyork commented May 14, 2025

ggerganov commented May 14, 2025

bartowski1182 commented May 15, 2025

jukofyork commented May 15, 2025

jukofyork commented May 15, 2025 •

edited

Loading

MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) #13529

Are you sure you want to change the base?

MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) #13529

Conversation

jukofyork commented May 14, 2025 • edited Loading

pwilkin commented May 14, 2025

jukofyork commented May 14, 2025

jukofyork commented May 14, 2025 • edited Loading

Panchovix commented May 14, 2025

jukofyork commented May 14, 2025

ggerganov commented May 14, 2025

bartowski1182 commented May 15, 2025

jukofyork commented May 15, 2025

jukofyork commented May 15, 2025 • edited Loading

jukofyork commented May 14, 2025 •

edited

Loading

jukofyork commented May 14, 2025 •

edited

Loading

jukofyork commented May 15, 2025 •

edited

Loading