Skip to content

vulkan: optimize flash attention split_k_reduce #14554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jeffbolznv
Copy link
Collaborator

Allow FA split_k with smaller KV values (remove KV >= 512 check).

Optimize the split_k_reduce shader. More threads to help with large head size. Spread the reductions across the workgroup.

These help with token gen perf, but I've also included pp results to show it's unaffected.

Before (coopmat2):

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 -d 128,8192 -r 10 --prio 1 -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\lmstudio-community\Llama-3.2-3B-Instruct-GGUF\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\second-state\StarCoder2-15B-GGUF\starcoder2-15b-Q4_0.gguf -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    10910.02 ± 454.90 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        177.95 ± 0.67 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7809.46 ± 90.46 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        152.88 ± 0.38 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    15002.51 ± 131.86 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        282.11 ± 2.19 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9753.64 ± 74.98 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        168.05 ± 0.72 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4639.97 ± 34.81 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        230.99 ± 0.83 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3910.90 ± 18.78 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        164.97 ± 0.50 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      8634.79 ± 67.62 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        101.50 ± 0.11 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5087.65 ± 18.53 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         84.43 ± 0.21 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    22924.61 ± 713.70 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        254.24 ± 1.13 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |    12703.90 ± 256.80 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        204.58 ± 0.56 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      5061.62 ± 27.02 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        103.61 ± 0.28 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3733.55 ± 17.76 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         95.53 ± 0.32 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      5418.40 ± 14.15 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |         98.53 ± 1.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4113.54 ± 25.42 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.67 ± 0.31 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4115.48 ± 84.06 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        195.17 ± 0.60 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      2406.70 ± 12.31 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        116.14 ± 0.19 |

After (coopmat2):

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 -d 128,8192 -r 10 --prio 1 -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\lmstudio-community\Llama-3.2-3B-Instruct-GGUF\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\second-state\StarCoder2-15B-GGUF\starcoder2-15b-Q4_0.gguf -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     11257.70 ± 64.28 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        190.51 ± 0.59 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      7815.87 ± 69.35 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        159.06 ± 0.35 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    14925.39 ± 205.87 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        283.82 ± 1.64 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      9680.15 ± 88.27 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        169.90 ± 0.32 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4651.26 ± 22.18 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        243.24 ± 1.10 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3911.83 ± 16.89 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        169.34 ± 0.51 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     8457.32 ± 298.49 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        108.28 ± 0.37 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5081.15 ± 16.58 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         88.62 ± 0.24 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |   22647.92 ± 1156.88 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        275.64 ± 2.49 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |     12790.08 ± 97.73 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        215.95 ± 0.33 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     4999.23 ± 135.73 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        108.72 ± 0.39 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3730.34 ± 15.81 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         98.81 ± 0.31 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      5405.83 ± 26.43 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        104.84 ± 0.29 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4138.30 ± 18.21 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         90.98 ± 0.54 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4043.98 ± 49.87 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        224.35 ± 2.05 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2418.33 ± 9.52 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        183.50 ± 1.05 |

Before (coopmat1):

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -m C:\models\Phi-3-mini-4k-instruct-q4.gguf -fa 1 -n 128 -p 512 -d 128,8192 -r 10 --prio 1 -m C:\models\bartowski\DeepSeek-Coder-V2-Lite-Instruct-GGUF\DeepSeek-Coder-V2-Lite-Instruct-Q2_K.gguf -m C:\models\bartowski\gemma-2-9b-it-GGUF\gemma-2-9b-it-Q8_0.gguf -m C:\models\lmstudio-community\Llama-3.2-3B-Instruct-GGUF\Llama-3.2-3B-Instruct-Q8_0.gguf -m C:\models\second-state\StarCoder2-15B-GGUF\starcoder2-15b-Q4_0.gguf -m C:\models\DeepSeek-R1-Distill-Qwen-14B-Q4_K_M.gguf -m C:\models\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     7286.77 ± 217.91 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        173.56 ± 0.78 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3683.48 ± 18.63 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        151.37 ± 0.40 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     10053.62 ± 83.22 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        282.49 ± 1.37 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4492.75 ± 35.10 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        166.22 ± 0.63 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      3011.95 ± 16.18 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        231.99 ± 2.06 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      2378.80 ± 13.63 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        163.12 ± 0.35 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4951.66 ± 68.56 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |         92.80 ± 0.25 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2576.57 ± 7.00 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         81.80 ± 0.35 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    11701.85 ± 519.83 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        242.36 ± 0.43 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5333.92 ± 73.87 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        203.83 ± 0.61 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      3550.81 ± 14.13 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        106.95 ± 0.21 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1992.32 ± 4.66 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.65 ± 0.23 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |       3257.96 ± 7.12 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |         96.37 ± 0.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1908.77 ± 4.32 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.01 ± 0.12 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |       1173.07 ± 3.04 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        205.06 ± 0.69 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        154.61 ± 1.19 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        152.32 ± 0.33 |

After (coopmat1):

ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      7314.85 ± 99.18 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        184.84 ± 0.80 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      3662.85 ± 27.41 |
| llama 8B Q4_0                  |   5.61 GiB |     8.03 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        154.68 ± 0.69 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    pp512 @ d128 |     10007.94 ± 73.64 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        281.38 ± 1.27 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      4448.88 ± 24.67 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        166.53 ± 0.48 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      2940.64 ± 31.89 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        240.99 ± 0.86 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2370.65 ± 3.10 |
| deepseek2 16B Q2_K - Medium    |   5.99 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        167.17 ± 0.69 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      4954.26 ± 80.62 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        102.08 ± 0.55 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       2606.14 ± 7.77 |
| gemma2 9B Q8_0                 |   9.15 GiB |     9.24 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         85.99 ± 0.21 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    pp512 @ d128 |    11824.86 ± 526.25 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        263.39 ± 1.54 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      5407.10 ± 30.49 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        207.88 ± 2.37 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      3543.18 ± 11.70 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        107.20 ± 0.65 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |      1972.23 ± 11.04 |
| starcoder2 15B Q4_0            |   8.44 GiB |    15.96 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         88.51 ± 0.22 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    pp512 @ d128 |       3220.99 ± 9.92 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        100.13 ± 0.35 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |       1885.02 ± 3.40 |
| qwen2 14B Q4_K - Medium        |   8.37 GiB |    14.77 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |         87.99 ± 0.30 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    pp512 @ d128 |      1136.61 ± 12.99 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |    tg128 @ d128 |        221.33 ± 1.35 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   pp512 @ d8192 |        148.80 ± 4.48 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |   tg128 @ d8192 |        163.04 ± 0.59 |

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a
lot for large HSV (like deepseek).
@jeffbolznv jeffbolznv requested a review from 0cc4m July 6, 2025 20:21
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant