-
Notifications
You must be signed in to change notification settings - Fork 11.8k
CUDA: FA support for Deepseek (Ampere or newer) #13306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: FA support for Deepseek (Ampere or newer) #13306
Conversation
Just tested on DeepSeek V3 0324 (Q2_K_XL) and it work fine, so you can use MLA + FA. I'm offloading ~110GB RAM to CPU and the rest on GPU (of a 255GB model), and this saves a lot of GPU usage. I get a bit less PP but I guess it's because CPU is slower with fa? But faster gen speed. Loading with (PC with Ryzen 7 7800X3D, 192GB RAM at 6000Mhz, Fedora 42)
When not using -fa
When using -fa
But if we move the regex a bit to use more tensors (as now we can use more because buffers weight a lot less)
I get
Which I still think has room for improvement, as some GPUs have >4GB left, but it works as quick test. Also I think it is using a slower GPU for PP (saturated at PCI-E 4.0 X8) instead of my faster GPU (at PCI-E 5.0 X8). Will check if I can change the GPU that does PP. |
I see lower performance with fa enabled at low contexts, but it improves with larger contexts. Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: d19838e (5276) |
Changed the device which process PP and speeds are pretty good. This is with DeepSeek V3 0324 UD_Q2_K_XL.
It seems to saturate X8 5.0 (26-27 GiB/s), but not X16 5.0 (tops about 28-29 GiB/s), so I guess there is a limitation somewhere. Hope this can be merged! As some latest updates are pretty good as well. |
64fcbe0
to
f21270f
Compare
f21270f
to
187054a
Compare
I was testing the new config options for Gemma and noticed that for non-Deepseek models the kernel in this PR was 5-10% slower than the one on master (meaning the runtime of the kernel itself, not end-to-end performance). I dug around a bit and as it turns out the CUDA compiler is unable to actually unroll the loops for loading KV data. I unrolled the loops manually which ends up being ugly but I don't know how else to do it. There are similar loops for loading Q and storing VKQ that could in principle be given the same treatment but there is no measurable difference to the kernel runtime - those loops are executed once each and not once per 32/64 tokens. |
Just an extra comment (sorry for so many!) but this PR is huge for PP performance if you increase ubatch thanks to the saved VRAM from the smaller buffers. As I posted above, with default ubatch 512, PP is 66 t/s With ubatch 1024
With ubatch 1536
This is an 25.7% increase over -ub 1024, 92.4% increase over -ub 512 and 225% increase over -ub 512 and PCI-E X8 4.0. This wouldn't be possible without this commit, so really, really thanks! EDIT: Improved it a little more now.
This is by using 2 less layers on GPU but increasing -ub to 2048 and -b to 2560. Just impressive. |
This is a very useful PR for my setup where I have all but the non-shared experts in VRAM:
It would be very "MLA-specific", but it's worth noting that the upper 512 elements of each K is the same as V (the first 64 elements hold the RoPE-ed values, and it must be the first 64 to work with the existing context shifting code):
|
@Panchovix Thanks for your post on Reddit! It was only after reading that, that I also tried the increasing |
Thank you, I misremembered how the |
@JohannesGaessler Is the drop of performance with deepseek lite at low contexts expected? |
It's expected in the sense that I had to do a lot of tradeoffs to make the kernel work at all. Compared to e.g. LLaMA with a head size of 128 the FA kernel for Deepseek just isn't as performant. So it only becomes faster once the non-FA implementation becomes slow enough. There's probably still optimization headroom but I don't know how much until I try. |
Actually, there is a reason why the FA kernel performs poorly for Deepseek in particular at short contexts. I forgot that with the MLA implementation Deepseek effectively has only a single KV head. The granularity with which KV slices are assigned to an SM is 256 tokens so there just aren't enough tasks to get good GPU utilization on e.g. an RTX 3090 ti with 84 SMs. |
static __device__ __forceinline__ unsigned int ggml_cuda_cvta_generic_to_shared(void * generic_ptr) { | ||
#ifdef CP_ASYNC_AVAILABLE | ||
return __cvta_generic_to_shared(generic_ptr); | ||
#else | ||
GGML_UNUSED(generic_ptr); | ||
NO_DEVICE_CODE; | ||
return 0; | ||
#endif // CP_ASYNC_AVAILABLE | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no fallback, why not avoid compiling the kernels that need this intrinsic in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of development it's more convenient for me if potential breakage is encapsulated in an API such as this. That way, if I need to do a git bisect of my WIP commits later on there is less risk of having to deal with code that doesn't compile on specific hardware.
ggml/src/ggml-cuda/fattn-mma-f16.cuh
Outdated
// The compiler is unable to unroll loops with the k0_start == k0_stop condition. | ||
// Therefore, write functions for the loop iterations and unroll the loops manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could avoid some code duplication using the Unroll
template from the AMX implementation at
llama.cpp/ggml/src/ggml-cpu/amx/mmq.cpp
Line 39 in f05a6d7
// Forced unrolling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is a good solution. For this to work in device code CUDA needs to be compiled with the flag -extended_lambda
. The flag was added with CUDA 8 and should be unprolematic, HIP and MUSA seem to work without modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integral_constant
to pass the loop index, and the auto
parameters of the lambda are important to ensure that the argument is constexpr
, otherwise you are still relying on the compiler to remove the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this particular case it seems like the compiler can still do the correct optimizations - I'll include a fix the next time I make a CUDA PR.
Again wanted to mention how awesome this PR is. It let me load DeepSeek V3 0324 Q3_K_XL (3.53BPW) on 192GB RAM on a consumer PC + 128GB VRAM, with 64K context. First gen is slow but then and onwards for next messages it works fine.
If tests are needed for a merge I can test! |
I'm merging the PR as-is, any potential performance optimizations I'll do in a follow-up PR. |
* origin/master: (39 commits) server : vision support via libmtmd (ggml-org#12898) sycl : implementation of reordered Q4_0 MMVQ for Intel GPUs (ggml-org#12858) metal : optimize MoE for large batches (ggml-org#13388) CUDA: FA support for Deepseek (Ampere or newer) (ggml-org#13306) llama : do not crash if there is no CPU backend (ggml-org#13395) CUDA: fix crash on large batch size for MoE models (ggml-org#13384) imatrix : Add --parse-special for enabling parsing of special tokens in imatrix calculation (ggml-org#13389) llama-run: add support for downloading models from ModelScope (ggml-org#13370) mtmd : fix batch_view for m-rope (ggml-org#13397) llama : one-off chat template fix for Mistral-Small-2503 (ggml-org#13398) rpc : add rpc_msg_set_tensor_hash_req (ggml-org#13353) vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326) server : (webui) rename has_multimodal --> modalities (ggml-org#13393) ci : limit write permission to only the release step + fixes (ggml-org#13392) mtmd : Expose helper_decode_image_chunk (ggml-org#13366) server : (webui) fix a very small misalignment (ggml-org#13387) server : (webui) revamp the input area, plus many small UI improvements (ggml-org#13365) convert : support rope_scaling type and rope_type (ggml-org#13349) mtmd : fix the calculation of n_tokens for smolvlm (ggml-org#13381) context : allow cache-less context for embeddings (ggml-org#13108) ...
Sadly with this PR, it seems Flash Attention has been broken on Turing as the model output gibberish (tested with Qwen 3 MoE) with patial offloading and probably full offloading as well (can't test with this model). Output b5299 with ./llama-cli -m "Qwen3-30B-A3B-UD-Q4_K_XL.gguf" -fa -ngl 10 -p "Hello /no_think"
Output b5331 with -fa and -ngl 10
Output b5331 without fa
|
Should be fixed by #13415 . |
This PR adds FlashAttention CUDA support for Deepseek models.
llama-graph.cpp
to trigger a more efficient CUDA code path. In principle something like this could also be done as an automatic optimization in either the CUDA backend or the compute graph. It may also make sense to add some function likeggml_permute_data_layout
that directly permutes thene
andnb
values of a tensor. As long as all backends support non-contiguous outputs that would save you from having to addggml_cont
.Performance changes