vulkan: optimize flash attention split_k_reduce #14554

jeffbolznv · 2025-07-06T20:21:15Z

Allow FA split_k with smaller KV values (remove KV >= 512 check).

Optimize the split_k_reduce shader. More threads to help with large head size. Spread the reductions across the workgroup.

These help with token gen perf, but I've also included pp results to show it's unaffected.

Before (coopmat2):

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 -d 128,8192 -r 10 --prio 1 -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\lmstudio-community\Llama-3.2-3B-Instruct-GGUF\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\second-state\StarCoder2-15B-GGUF\starcoder2-15b-Q4_0.gguf -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    10910.02 ± 454.90 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        177.95 ± 0.67 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7809.46 ± 90.46 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        152.88 ± 0.38 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    15002.51 ± 131.86 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        282.11 ± 2.19 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9753.64 ± 74.98 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        168.05 ± 0.72 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4639.97 ± 34.81 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        230.99 ± 0.83 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3910.90 ± 18.78 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        164.97 ± 0.50 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      8634.79 ± 67.62 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        101.50 ± 0.11 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5087.65 ± 18.53 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         84.43 ± 0.21 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    22924.61 ± 713.70 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        254.24 ± 1.13 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |    12703.90 ± 256.80 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        204.58 ± 0.56 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      5061.62 ± 27.02 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        103.61 ± 0.28 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3733.55 ± 17.76 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         95.53 ± 0.32 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      5418.40 ± 14.15 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |         98.53 ± 1.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4113.54 ± 25.42 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.67 ± 0.31 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4115.48 ± 84.06 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        195.17 ± 0.60 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      2406.70 ± 12.31 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        116.14 ± 0.19 |

After (coopmat2):

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 -d 128,8192 -r 10 --prio 1 -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\lmstudio-community\Llama-3.2-3B-Instruct-GGUF\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\second-state\StarCoder2-15B-GGUF\starcoder2-15b-Q4_0.gguf -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     11257.70 ± 64.28 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        190.51 ± 0.59 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7815.87 ± 69.35 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        159.06 ± 0.35 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    14925.39 ± 205.87 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        283.82 ± 1.64 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9680.15 ± 88.27 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        169.90 ± 0.32 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4651.26 ± 22.18 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        243.24 ± 1.10 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3911.83 ± 16.89 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        169.34 ± 0.51 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     8457.32 ± 298.49 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        108.28 ± 0.37 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5081.15 ± 16.58 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         88.62 ± 0.24 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |   22647.92 ± 1156.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        275.64 ± 2.49 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     12790.08 ± 97.73 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        215.95 ± 0.33 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     4999.23 ± 135.73 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        108.72 ± 0.39 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3730.34 ± 15.81 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         98.81 ± 0.31 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      5405.83 ± 26.43 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        104.84 ± 0.29 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4138.30 ± 18.21 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         90.98 ± 0.54 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4043.98 ± 49.87 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        224.35 ± 2.05 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2418.33 ± 9.52 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        183.50 ± 1.05 |

Before (coopmat1):

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 -d 128,8192 -r 10 --prio 1 -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\lmstudio-community\Llama-3.2-3B-Instruct-GGUF\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\second-state\StarCoder2-15B-GGUF\starcoder2-15b-Q4_0.gguf -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     7286.77 ± 217.91 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        173.56 ± 0.78 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3683.48 ± 18.63 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.37 ± 0.40 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     10053.62 ± 83.22 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        282.49 ± 1.37 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4492.75 ± 35.10 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        166.22 ± 0.63 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      3011.95 ± 16.18 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        231.99 ± 2.06 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      2378.80 ± 13.63 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        163.12 ± 0.35 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4951.66 ± 68.56 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |         92.80 ± 0.25 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2576.57 ± 7.00 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         81.80 ± 0.35 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    11701.85 ± 519.83 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        242.36 ± 0.43 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5333.92 ± 73.87 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        203.83 ± 0.61 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      3550.81 ± 14.13 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        106.95 ± 0.21 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1992.32 ± 4.66 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.65 ± 0.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |       3257.96 ± 7.12 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |         96.37 ± 0.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1908.77 ± 4.32 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.01 ± 0.12 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |       1173.07 ± 3.04 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        205.06 ± 0.69 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        154.61 ± 1.19 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        152.32 ± 0.33 |

After (coopmat1):

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      7314.85 ± 99.18 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        184.84 ± 0.80 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3662.85 ± 27.41 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        154.68 ± 0.69 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     10007.94 ± 73.64 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        281.38 ± 1.27 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4448.88 ± 24.67 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        166.53 ± 0.48 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      2940.64 ± 31.89 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        240.99 ± 0.86 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2370.65 ± 3.10 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        167.17 ± 0.69 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4954.26 ± 80.62 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        102.08 ± 0.55 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2606.14 ± 7.77 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         85.99 ± 0.21 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    11824.86 ± 526.25 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        263.39 ± 1.54 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5407.10 ± 30.49 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        207.88 ± 2.37 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      3543.18 ± 11.70 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        107.20 ± 0.65 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      1972.23 ± 11.04 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         88.51 ± 0.22 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |       3220.99 ± 9.92 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        100.13 ± 0.35 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1885.02 ± 3.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.99 ± 0.30 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      1136.61 ± 12.99 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        221.33 ± 1.35 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        148.80 ± 4.48 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        163.04 ± 0.59 |

k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

jeffbolznv added 2 commits July 6, 2025 10:25

vulkan: allow FA split_k with smaller KV values

314e0e6

vulkan: spread split_k_reduce work across more threads

8f24cd9

k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

jeffbolznv requested a review from 0cc4m July 6, 2025 20:21

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: optimize flash attention split_k_reduce #14554

vulkan: optimize flash attention split_k_reduce #14554

jeffbolznv commented Jul 6, 2025

Uh oh!

Uh oh!

vulkan: optimize flash attention split_k_reduce #14554

Are you sure you want to change the base?

vulkan: optimize flash attention split_k_reduce #14554

Conversation

jeffbolznv commented Jul 6, 2025

Uh oh!

Uh oh!