-
Notifications
You must be signed in to change notification settings - Fork 11.8k
MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) #13529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…/llama.cpp into mla-fa-disable-v-cache
Wow, this is pretty huge. Would go a long way towards supporting long contexts on potato devices. |
You'll need to use something like this to get the merge of the 2 PRs: git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/13435/head:pr-13435
git fetch origin pull/13529/head:pr-13529
git merge pr-13435 --no-edit
git merge pr-13529 --no-edit (there may be a better way, but I'm pretty dumb when it comes to using |
Yeah, it's nuts: I can now get 160k context, with a |
#13435 got merged, so this could be converted back to PR now I think. |
It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did. |
The CPU implementation does not seem to support it. This command generates junk for me:
Maybe look into fixing it and adding |
Does this have any speed implications by not storing the values ? |
Yeah, I think having a non-cont V-cache is likely to break most of the backends. I'll try and have a look at fixing this and adding the test today or tomorrow. |
If the backends are rewritten to only use the K-cache, then it could have a big performance improvement for some of the backends due to not having to access the V-cache in memory (eg: a CPU's L3-cache will only have to store K-cache elements). If the backends are just fixed to use the non-contiguous view of the K-cache in place of the V-cache, there will likely be some degree of performance reduction though. |
From #13435:
I've only tested this to work with #13435 for now, but it should still work with the other backends' flash attention implementations so long as they don't assume that the V-cache they are passed is contiguous:
The full context of 160k tokens now takes up less than 11GB:
! 😮
I've just disabled context shifting for now, as like I said in the other post; I'm not at all confident I can cleanly change all that is required to deal with the empty V-cache:
I will leave it as a draft for now, as the other backends' flash attention implementations need to either:
A. Check they can work with the non-contiguous V-cache passed to them.
B. Copy @JohannesGaessler's strategy of only using the last 512 elements of the K-cache in place of the V-cache.
I think (B) is preferable, as there is likely to be some significant gains possible regarding CPU cache (re-)use, etc.