+When using standard attention operators, each sequence is computed independently, causing the same KV cache to be loaded repeatedly, resulting in significant waste. To eliminate this inefficiency and fully leverage the performance advantages of our MTP approach, we developed a custom MTP operator based on Flash Attention v3: [fa3_mtp](https://github.com/ModelTC/LightKernel/tree/main/flash-attention/hopper), and you can use it in lightllm's [fa3_mtp branch](https://github.com/ModelTC/LightLLM/blob/fa3_mtp/lightllm/models/deepseek2/layer_infer/transformer_layer_infer.py#L564). This operator combines the queries (Q) of a group of sequences into a unified computation. During the $QK^T$ computation (where $Q$ is the query matrix and $K^T$ is the transpose of the key matrix), it dynamically sets the mask for the $Score$ matrix by calculating the seq_len corresponding to each q row.
0 commit comments