-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
fix cuda graph #22721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix cuda graph #22721
Conversation
Signed-off-by: fsx950223 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the CUDA graph support flag for the AiterFlashAttention backend. While the change to use the AttentionCGSupport
enum is a good step towards standardization, I've identified a critical issue in the underlying implementation. The backend is marked as always supporting CUDA graphs, but the forward pass for prefill scenarios appears to be buggy, calling two attention kernels sequentially and overwriting the output. This makes the prefill path incorrect. I've recommended adjusting the support level to PURE_DECODE_ONLY
to accurately reflect the backend's current state and prevent capturing faulty logic in CUDA graphs until the implementation is fixed.
attn_cudagraph_support: ClassVar[ | ||
AttentionCGSupport] = AttentionCGSupport.ALWAYS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting attn_cudagraph_support
to ALWAYS
seems incorrect given the current implementation of AiterFlashAttentionImpl.forward
.
In the forward
method, for prefill scenarios (where max_seqlen_q > 1
), both torch.ops.vllm.flash_attn_varlen_func
and torch.ops.aiter.paged_attention_v1
are executed sequentially. The second call will overwrite the results of the first, which appears to be a bug that makes the prefill path produce incorrect results.
Given that the decode path (max_seqlen_q <= 1
) only calls paged_attention_v1
and might be correct, attn_cudagraph_support
should be set to AttentionCGSupport.PURE_DECODE_ONLY
. This would accurately reflect that only decode operations can be safely captured in a CUDA graph. Setting it to ALWAYS
is misleading and could lead to capturing incorrect logic in CUDA graphs for prefill/mixed batches.
I suggest changing this to PURE_DECODE_ONLY
until the prefill logic is fixed.
attn_cudagraph_support: ClassVar[ | |
AttentionCGSupport] = AttentionCGSupport.ALWAYS | |
attn_cudagraph_support: ClassVar[ | |
AttentionCGSupport] = AttentionCGSupport.PURE_DECODE_ONLY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsx950223
This attn_cudagraph_support
has been renamed to cudagraph_support
in this PR #20059 (merged 4 days ago into main)
The full_cuda_graph
is deprecated.
It is now controlled through --compilation-config '{"cudagraph_mode": "FULL"}'
There are multiple modes introduced in this PR #20059
vllm/vllm/config/compilation.py
Lines 226 to 261 in f856c33
cudagraph_mode: Optional[CUDAGraphMode] = None | |
""" | |
The mode of the cudagraph. | |
- NONE, no cudagraph capture. | |
- PIECEWISE. (v1 default) | |
- FULL. | |
- FULL_DECODE_ONLY. | |
- FULL_AND_PIECEWISE. | |
PIECEWISE mode build piecewise cudagraph only, keeping the cudagraph | |
incompatiable ops (i.e. some attention ops) outside the cudagraph | |
for general flexibility. | |
This is the default mode. | |
FULL mode: Capture full cudagraph for all batches. Can be good for small | |
models or workloads with small prompts; not supported by many backends. | |
Generally for performance FULL_AND_PIECEWISE is better. | |
FULL_DECODE_ONLY mode: Capture full cudagraph for decode batches only. | |
Mixed prefill-decode batches are run without cudagraphs. Can be good for | |
decode instances in a P/D setup where prefill is not as important so we | |
can save some memory. | |
FULL_AND_PIECEWISE mode: Capture full cudagraph for decode batches and | |
piecewise cudagraph for prefill and mixed prefill-decode batches. | |
This is like the most performant mode for most models. | |
Currently, the cudagraph mode is only used for the v1 engine. | |
Note that the cudagraph logic is generally orthogonal to the | |
compilation logic. While piecewise cudagraphs require piecewise | |
compilation (level=PIECEWISE and non-empty splitting_ops), full | |
cudagraphs are supported with and without compilation. | |
Warning: This flag is new and subject to change in addition | |
more modes may be added. | |
""" |
Signed-off-by: fsx950223 <[email protected]>
Additional Tests: Server command: #!/bin/bash
rm -rf /root/.cache/vllm
MODEL=Qwen/Qwen3-8B
VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 \
vllm serve $MODEL \
--tensor-parallel-size 8 \
--max-num-seqs 1024 \
--kv-cache-dtype fp8 \
--max-num-batched-tokens 32768 \
--disable-log-requests \
--compilation-config '{"cudagraph_mode": "FULL"}' \
--trust-remote-code server log:
Client command: #!/bin/bash
lm_eval \
--model local-completions \
--tasks gsm8k \
--model_args model=Qwen/Qwen3-8B,base_url=http://127.0.0.1:8000/v1/completions \
--batch_size 100 \
> lmeval_server-Qwen_Qwen3-8B-aiter-v1-mha-fullgraphmode_FULL_PR.log 2>&1 lm eval score of
|
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]> Signed-off-by: Duncan Moss <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Signed-off-by: fsx950223 <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update