Skip to content

metal : use FA-vec kernel up to batch size 20 #13496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 13, 2025
Merged

Conversation

ggerganov
Copy link
Member

With the optimization in #13493 we can now gain significant improvement in parallel generations typical for multi-user scenarios when FA is enabled. The reason is that during text generation, each query from the batch with multiple sequences will now mostly attend only to its own tokens thanks to the improved masking logic.

For example, here is a simulation of 1 to 8 parallel requests, each with a different prompt of 8192 tokens. The TG speed is now much better for multiple requests.

make -j && ./bin/llama-batched-bench -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf -c 65560 -b 2048 -ub 512 -npp 8192 -ntg 32 -npl 1,2,3,4,5,6,7,8 -fa
  • master

main: n_kv_max = 65792, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
8192 32 1 8224 8.241 994.03 0.498 64.31 8.739 941.09
8192 32 2 16448 16.167 1013.42 0.766 83.51 16.933 971.34
8192 32 3 24672 24.568 1000.34 1.402 68.45 25.970 950.02
8192 32 4 32896 33.256 985.33 2.478 51.65 35.734 920.57
8192 32 5 41120 42.168 971.34 2.980 53.70 45.148 910.78
8192 32 6 49344 51.395 956.36 3.592 53.46 54.987 897.38
8192 32 7 57568 60.852 942.35 4.226 53.00 65.078 884.60
8192 32 8 65792 70.557 928.84 4.649 55.07 75.205 874.83
  • PR

main: n_kv_max = 65792, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
8192 32 1 8224 8.230 995.34 0.495 64.68 8.725 942.57
8192 32 2 16448 16.159 1013.92 0.674 95.00 16.833 977.14
8192 32 3 24672 24.580 999.84 1.007 95.38 25.586 964.26
8192 32 4 32896 33.267 985.01 1.095 116.87 34.362 957.33
8192 32 5 41120 42.233 969.85 1.297 123.37 43.530 944.63
8192 32 6 49344 51.464 955.08 1.540 124.70 53.003 930.96
8192 32 7 57568 60.922 941.27 1.774 126.29 62.696 918.21
8192 32 8 65792 74.733 876.94 1.971 129.89 76.704 857.74

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 13, 2025
Base automatically changed from gg/metal-fa-vec-mask-opt to master May 13, 2025 15:04
@ggerganov ggerganov merged commit f0995d2 into master May 13, 2025
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant