Skip to content

Commit f86643c

Browse files
committed
add fa3_mtp branch
1 parent 0e8b7bb commit f86643c

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_posts/2025-09-01-mtp.md renamed to _posts/2025-09-04-mtp.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Due to our unique inference pattern, during the decode phase, adjacent sequences
1818
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">During the decode phase, both sequences utilize KV cache for tokens t1, t2, t3, t4. The first sequence uses the first three caches, while the second sequence uses all four caches.</p>
1919
</div>
2020

21-
When using standard attention operators, each sequence is computed independently, causing the same KV cache to be loaded repeatedly, resulting in significant waste. To eliminate this inefficiency and fully leverage the performance advantages of our MTP approach, we developed a custom MTP operator based on Flash Attention v3: [fa3_mtp](https://github.com/ModelTC/LightKernel/tree/main/flash-attention/hopper). This operator combines the queries (Q) of a group of sequences into a unified computation. During the $QK^T$ computation (where $Q$ is the query matrix and $K^T$ is the transpose of the key matrix), it dynamically sets the mask for the $Score$ matrix by calculating the seq_len corresponding to each q row.
21+
When using standard attention operators, each sequence is computed independently, causing the same KV cache to be loaded repeatedly, resulting in significant waste. To eliminate this inefficiency and fully leverage the performance advantages of our MTP approach, we developed a custom MTP operator based on Flash Attention v3: [fa3_mtp](https://github.com/ModelTC/LightKernel/tree/main/flash-attention/hopper), and you can use it in lightllm's [fa3_mtp branch](https://github.com/ModelTC/LightLLM/blob/fa3_mtp/lightllm/models/deepseek2/layer_infer/transformer_layer_infer.py#L564). This operator combines the queries (Q) of a group of sequences into a unified computation. During the $QK^T$ computation (where $Q$ is the query matrix and $K^T$ is the transpose of the key matrix), it dynamically sets the mask for the $Score$ matrix by calculating the seq_len corresponding to each q row.
2222

2323
<div style="text-align: center;">
2424
<img src="{{ site.baseurl }}/assets/images/blogs/05-mtp/fa3_mtp.png" style="zoom: 60%;" />

0 commit comments

Comments
 (0)