-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel #22703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel #22703
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel, which is a significant enhancement for NVIDIA Blackwell GPUs. The changes are extensive, touching benchmarks, tests, and core attention and compilation logic. The implementation of the new QuantAttentionQuantPattern
fusion pass is particularly noteworthy, enabling more advanced quantization-aware optimizations. The code appears well-structured, and the necessary updates to tests and benchmarks have been included to validate the new functionality. The approach to handle framework limitations, such as using a cache for fusion status and offline scale extraction, is pragmatic. Overall, this is a high-quality contribution that should bring performance improvements for supported hardware.
8c485e3
to
c6d8400
Compare
This will be blocked by FlashInfer fix: flashinfer-ai/flashinfer#1460 |
This pull request has merge conflicts that must be resolved before it can be |
Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
45c58fb
to
6cb8a78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model access should be resolved now, thanks!
…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>
…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR based on the previous attn + FP8-quant fusion #21716, adding another attn + NVFP4-quant fusion for supporting TRTLLM-gen attn kernel.
Test Plan && Test Result
Functional:
tests/compile/test_fusion_attn.py::test_attention_quant_pattern
E2E Performance:
nvidia/Llama-4-Scout-17B-16E-Instruct-FP4
main
PR
(Optional) Documentation Update