vulkan: optimizations for deepseek prompt processing #14555

jeffbolznv · 2025-07-06T22:36:19Z

Some optimizations for mul_mat_id, and flash attention with large head size. See commit messages for more detail.

before:

coopmat2:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |      4061.98 ± 17.67 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        210.15 ± 0.88 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      2392.49 ± 16.93 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        116.90 ± 0.26 |

coopmat1:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |       1013.83 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        203.44 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        159.70 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.71 ± 0.00 |

scalar:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |        868.16 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        202.75 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        161.14 ± 0.00 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.54 ± 0.00 |

after:

coopmat2:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |      6419.30 ± 17.34 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        209.36 ± 0.86 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3606.39 ± 10.05 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        116.51 ± 0.28 |

coopmat1:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |      5412.19 ± 58.07 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        202.60 ± 0.77 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1630.19 ± 3.38 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.86 ± 0.46 |

scalar:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf -fa 1 -n 128 -p 512 -d 512,8192 --prio 1 -r 5
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d512 |       1833.69 ± 7.71 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d512 |        204.14 ± 0.45 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1034.34 ± 1.45 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.88 ± 0.27 |

…o coopmat1 path

…oth scalar and CM2 paths (CM1 isn't used due to shared memory limits)

jeffbolznv added 4 commits July 6, 2025 16:57

vulkan: allow unclamped loads in coopmat2 mul_mat_id shader

129a0f1

vulkan: increase coopmat2 mul_mat_id tile size

b54ddba

vulkan: optimize mat_mul_id row_ids search to batch loads, and port t…

2b54086

…o coopmat1 path

vulkan: use smaller FA row size when head size is large. applies to b…

bd8e0bf

…oth scalar and CM2 paths (CM1 isn't used due to shared memory limits)

jeffbolznv requested a review from 0cc4m July 6, 2025 22:36

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: optimizations for deepseek prompt processing #14555

vulkan: optimizations for deepseek prompt processing #14555

jeffbolznv commented Jul 6, 2025

Uh oh!

Uh oh!

vulkan: optimizations for deepseek prompt processing #14555

Are you sure you want to change the base?

vulkan: optimizations for deepseek prompt processing #14555

Conversation

jeffbolznv commented Jul 6, 2025

Uh oh!

Uh oh!