metal : use FA-vec kernel up to batch size 20 #13496

ggerganov · 2025-05-13T07:59:13Z

With the optimization in #13493 we can now gain significant improvement in parallel generations typical for multi-user scenarios when FA is enabled. The reason is that during text generation, each query from the batch with multiple sequences will now mostly attend only to its own tokens thanks to the improved masking logic.

For example, here is a simulation of 1 to 8 parallel requests, each with a different prompt of 8192 tokens. The TG speed is now much better for multiple requests.

make -j && ./bin/llama-batched-bench -m ../models/qwen2.5-7b-coder-instruct/ggml-model-q4_k.gguf -c 65560 -b 2048 -ub 512 -npp 8192 -ntg 32 -npl 1,2,3,4,5,6,7,8 -fa

master

main: n_kv_max = 65792, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	8.241	994.03	0.498	64.31	8.739	941.09
8192	32	2	16448	16.167	1013.42	0.766	83.51	16.933	971.34
8192	32	3	24672	24.568	1000.34	1.402	68.45	25.970	950.02
8192	32	4	32896	33.256	985.33	2.478	51.65	35.734	920.57
8192	32	5	41120	42.168	971.34	2.980	53.70	45.148	910.78
8192	32	6	49344	51.395	956.36	3.592	53.46	54.987	897.38
8192	32	7	57568	60.852	942.35	4.226	53.00	65.078	884.60
8192	32	8	65792	70.557	928.84	4.649	55.07	75.205	874.83

PR

main: n_kv_max = 65792, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
8192	32	1	8224	8.230	995.34	0.495	64.68	8.725	942.57
8192	32	2	16448	16.159	1013.92	0.674	95.00	16.833	977.14
8192	32	3	24672	24.580	999.84	1.007	95.38	25.586	964.26
8192	32	4	32896	33.267	985.01	1.095	116.87	34.362	957.33
8192	32	5	41120	42.233	969.85	1.297	123.37	43.530	944.63
8192	32	6	49344	51.464	955.08	1.540	124.70	53.003	930.96
8192	32	7	57568	60.922	941.27	1.774	126.29	62.696	918.21
8192	32	8	65792	74.733	876.94	1.971	129.89	76.704	857.74

ggml-ci

ggerganov added 3 commits May 13, 2025 07:55

batched-bench : fix pp batch contents

f078c79

metal : optimize multi-sequence FA vec kernel

fdfc7de

ggml-ci

metal : use FA-vec kernel up to batch size 20

78d7022

ggml-ci

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 13, 2025

Base automatically changed from gg/metal-fa-vec-mask-opt to master May 13, 2025 15:04

ggerganov merged commit f0995d2 into master May 13, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : use FA-vec kernel up to batch size 20 #13496

metal : use FA-vec kernel up to batch size 20 #13496

ggerganov commented May 13, 2025

metal : use FA-vec kernel up to batch size 20 #13496

metal : use FA-vec kernel up to batch size 20 #13496

Conversation

ggerganov commented May 13, 2025