[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel #22703

elvischenv · 2025-08-12T04:18:00Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR based on the previous attn + FP8-quant fusion #21716, adding another attn + NVFP4-quant fusion for supporting TRTLLM-gen attn kernel.

Test Plan && Test Result

Functional:

NVFP4 TRTLLM Prefill/Decode kernel unit test: tests/kernels/attention/test_flashinfer_trtllm_attention.py

======= 44 passed, 4 skipped, 4 warnings in 64.13s (0:01:04) ======

AttentionStaticQuantPattern fusion_attn pass: tests/compile/test_fusion_attn.py::test_attention_quant_pattern

======== 12 passed, 4 warnings in 68.53s (0:01:08) ========

E2E Performance: nvidia/Llama-4-Scout-17B-16E-Instruct-FP4

main

--kv-cache-dtype = auto                                                      --kv-cache-dtype = fp8
============ Serving Benchmark Result ============                           ============ Serving Benchmark Result ============
Successful requests:                     640                                 Successful requests:                     640
Maximum request concurrency:             128                                 Maximum request concurrency:             128
Benchmark duration (s):                  143.36                              Benchmark duration (s):                  164.40
Total input tokens:                      653975                              Total input tokens:                      653975
Total generated tokens:                  655360                              Total generated tokens:                  655360
Request throughput (req/s):              4.46                                Request throughput (req/s):              3.89
Output token throughput (tok/s):         4571.39                             Output token throughput (tok/s):         3986.50
Total Token throughput (tok/s):          9133.12                             Total Token throughput (tok/s):          7964.57
---------------Time to First Token----------------                           ---------------Time to First Token----------------
Mean TTFT (ms):                          677.91                              Mean TTFT (ms):                          751.94
Median TTFT (ms):                        475.19                              Median TTFT (ms):                        521.30
P99 TTFT (ms):                           2509.39                             P99 TTFT (ms):                           2859.41
-----Time per Output Token (excl. 1st token)------                           -----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.32                               Mean TPOT (ms):                          31.36
Median TPOT (ms):                        27.32                               Median TPOT (ms):                        31.40
P99 TPOT (ms):                           28.20                               P99 TPOT (ms):                           32.38
---------------Inter-token Latency----------------                           ---------------Inter-token Latency----------------
Mean ITL (ms):                           27.32                               Mean ITL (ms):                           31.36
Median ITL (ms):                         25.66                               Median ITL (ms):                         29.73
P99 ITL (ms):                            146.45                              P99 ITL (ms):                            168.08
==================================================                           ==================================================
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|    |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|    |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9037|±  |0.0081|    |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8969|±  |0.0084|
|     |       |strict-match    |     5|exact_match|↑  |0.8908|±  |0.0086|    |     |       |strict-match    |     5|exact_match|↑  |0.8817|±  |0.0089|

PR

--kv-cache-dtype = auto                                                      --kv-cache-dtype = fp8
============ Serving Benchmark Result ============                           ============ Serving Benchmark Result ============
Successful requests:                     640                                 Successful requests:                     640
Maximum request concurrency:             128                                 Maximum request concurrency:             128
Benchmark duration (s):                  143.82                              Benchmark duration (s):                  164.89
Total input tokens:                      653975                              Total input tokens:                      653975
Total generated tokens:                  655360                              Total generated tokens:                  655360
Request throughput (req/s):              4.45                                Request throughput (req/s):              3.88
Output token throughput (tok/s):         4556.72                             Output token throughput (tok/s):         3974.57
Total Token throughput (tok/s):          9103.80                             Total Token throughput (tok/s):          7940.75
---------------Time to First Token----------------                           ---------------Time to First Token----------------
Mean TTFT (ms):                          694.81                              Mean TTFT (ms):                          750.19
Median TTFT (ms):                        476.66                              Median TTFT (ms):                        522.09
P99 TTFT (ms):                           2595.24                             P99 TTFT (ms):                           2847.96
-----Time per Output Token (excl. 1st token)------                           -----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          27.40                               Mean TPOT (ms):                          31.46
Median TPOT (ms):                        27.38                               Median TPOT (ms):                        31.51
P99 TPOT (ms):                           28.52                               P99 TPOT (ms):                           32.46
---------------Inter-token Latency----------------                           ---------------Inter-token Latency----------------
Mean ITL (ms):                           27.40                               Mean ITL (ms):                           31.46
Median ITL (ms):                         25.68                               Median ITL (ms):                         29.84
P99 ITL (ms):                            153.03                              P99 ITL (ms):                            171.23
==================================================                           ==================================================
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|    |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|    |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9075|±  |0.0080|    |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8984|±  |0.0083|
|     |       |strict-match    |     5|exact_match|↑  |0.8954|±  |0.0084|    |     |       |strict-match    |     5|exact_match|↑  |0.8832|±  |0.0088|

--kv-cache-dtype = fp8, enable_attn_fusion
============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  130.05
Total input tokens:                      653975
Total generated tokens:                  655360
Request throughput (req/s):              4.92
Output token throughput (tok/s):         5039.38
Total Token throughput (tok/s):          10068.10
---------------Time to First Token----------------
Mean TTFT (ms):                          658.54
Median TTFT (ms):                        456.26
P99 TTFT (ms):                           2400.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.74
Median TPOT (ms):                        24.74
P99 TPOT (ms):                           25.72
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.74
Median ITL (ms):                         23.13
P99 ITL (ms):                            140.43
==================================================
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9014|±  |0.0082|
|     |       |strict-match    |     5|exact_match|↑  |0.8863|±  |0.0087|

(Optional) Documentation Update

github-actions · 2025-08-12T04:18:08Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces support for the Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel, which is a significant enhancement for NVIDIA Blackwell GPUs. The changes are extensive, touching benchmarks, tests, and core attention and compilation logic. The implementation of the new QuantAttentionQuantPattern fusion pass is particularly noteworthy, enabling more advanced quantization-aware optimizations. The code appears well-structured, and the necessary updates to tests and benchmarks have been included to validate the new functionality. The approach to handle framework limitations, such as using a cache for fusion status and offline scale extraction, is pragmatic. Overall, this is a high-quality contribution that should bring performance improvements for supported hardware.

benchmarks/kernels/benchmark_trtllm_decode_attention.py

benchmarks/kernels/benchmark_trtllm_prefill_attention.py

tests/compile/test_fusion_attn.py

vllm/v1/attention/backends/flashinfer.py

nvpohanh · 2025-08-12T13:02:47Z

This will be blocked by FlashInfer fix: flashinfer-ai/flashinfer#1460

mergify · 2025-08-13T13:19:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: elvischenv <[email protected]>

Signed-off-by: elvischenv <[email protected]>

mgoin

Model access should be resolved now, thanks!

…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Ekagra Ranjan <[email protected]>

…Attention Kernel (vllm-project#22703) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

elvischenv requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, sighingnow, tlrmchlsmth, youkaichao, ywang96, zhuohan123 and zou3519 as code owners August 12, 2025 04:18

mergify bot added ci/build llama Related to Llama models performance Performance-related issues rocm Related to AMD ROCm v1 tpu Related to Google TPUs labels Aug 12, 2025

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

weireweire reviewed Aug 12, 2025

View reviewed changes

benchmarks/kernels/benchmark_trtllm_decode_attention.py Outdated Show resolved Hide resolved

weireweire reviewed Aug 12, 2025

View reviewed changes

benchmarks/kernels/benchmark_trtllm_decode_attention.py Outdated Show resolved Hide resolved

weireweire reviewed Aug 12, 2025

View reviewed changes

benchmarks/kernels/benchmark_trtllm_prefill_attention.py Outdated Show resolved Hide resolved

weireweire reviewed Aug 12, 2025

View reviewed changes

tests/compile/test_fusion_attn.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

elvischenv force-pushed the elvischenv/fp4-trtllm-attn branch 3 times, most recently from 8c485e3 to c6d8400 Compare August 12, 2025 09:08

mergify bot added the needs-rebase label Aug 13, 2025

elvischenv and others added 3 commits August 22, 2025 00:39

Update quant_utils.py

7be0404

Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: elvischenv <[email protected]>

fix CI

ceda61e

Signed-off-by: elvischenv <[email protected]>

fix pattern match for torch 2.8.0

6cb8a78

Signed-off-by: elvischenv <[email protected]>

elvischenv force-pushed the elvischenv/fp4-trtllm-attn branch from 45c58fb to 6cb8a78 Compare August 22, 2025 07:44

elvischenv mentioned this pull request Aug 22, 2025

Update PyTorch to 2.8.0 #20358

Merged

10 tasks

ProExpertProg approved these changes Aug 22, 2025

View reviewed changes

ProExpertProg enabled auto-merge (squash) August 22, 2025 13:59

Merge branch 'main' into elvischenv/fp4-trtllm-attn

813eb13

mgoin approved these changes Aug 22, 2025

View reviewed changes

ProExpertProg merged commit 24d0c9e into vllm-project:main Aug 22, 2025
48 checks passed

elvischenv deleted the elvischenv/fp4-trtllm-attn branch August 25, 2025 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel #22703

[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel #22703

Uh oh!

elvischenv commented Aug 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvpohanh commented Aug 12, 2025

Uh oh!

mergify bot commented Aug 13, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel #22703

[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel #22703

Uh oh!

Conversation

elvischenv commented Aug 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan && Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvpohanh commented Aug 12, 2025

Uh oh!

mergify bot commented Aug 13, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

elvischenv commented Aug 12, 2025 •

edited by github-actions bot

Loading