kv-cache : improve defrag logic #13497
Labels
enhancement
New feature or request
performance
Speed related topics
roadmap
Part of a roadmap project
Following the optimization in #13493, I realized that the defragmentation can become much better so that it can further improve the Flash Attention masking.
Currently we defrag the following cache like this:
I.e. we only "fill" the holes, but the sequences remain scattered. We can do better like this:
By doing so, the FA-vec masking logic will remain effective even after many generations.
The text was updated successfully, but these errors were encountered: