add fa3_mtp branch

WANDY666 · WANDY666 · commit f86643c0c4b1 · 2025-09-04T04:06:37.000Z
diff --git a/_posts/2025-09-04-mtp.md b/_posts/2025-09-04-mtp.md
@@ -18,7 +18,7 @@ Due to our unique inference pattern, during the decode phase, adjacent sequences
   <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">During the decode phase, both sequences utilize KV cache for tokens t1, t2, t3, t4. The first sequence uses the first three caches, while the second sequence uses all four caches.</p>
 </div>
 
-When using standard attention operators, each sequence is computed independently, causing the same KV cache to be loaded repeatedly, resulting in significant waste. To eliminate this inefficiency and fully leverage the performance advantages of our MTP approach, we developed a custom MTP operator based on Flash Attention v3: [fa3_mtp](https://github.com/ModelTC/LightKernel/tree/main/flash-attention/hopper). This operator combines the queries (Q) of a group of sequences into a unified computation. During the $QK^T$ computation (where $Q$ is the query matrix and $K^T$ is the transpose of the key matrix), it dynamically sets the mask for the $Score$ matrix by calculating the seq_len corresponding to each q row.
+When using standard attention operators, each sequence is computed independently, causing the same KV cache to be loaded repeatedly, resulting in significant waste. To eliminate this inefficiency and fully leverage the performance advantages of our MTP approach, we developed a custom MTP operator based on Flash Attention v3: [fa3_mtp](https://github.com/ModelTC/LightKernel/tree/main/flash-attention/hopper), and you can use it in lightllm's [fa3_mtp branch](https://github.com/ModelTC/LightLLM/blob/fa3_mtp/lightllm/models/deepseek2/layer_infer/transformer_layer_infer.py#L564). This operator combines the queries (Q) of a group of sequences into a unified computation. During the $QK^T$ computation (where $Q$ is the query matrix and $K^T$ is the transpose of the key matrix), it dynamically sets the mask for the $Score$ matrix by calculating the seq_len corresponding to each q row.
 
 <div style="text-align: center;">
   <img src="{{ site.baseurl }}/assets/images/blogs/05-mtp/fa3_mtp.png"  style="zoom: 60%;" />