Add ability to use CUDAGraphs with use_inductor=False #17345

zou3519 · 2025-04-29T02:52:12Z

Should fix #15896, unless users actually want to turn off TorchDynamo too (keep reading for context)

This PR adds the ability to specify use_inductor=False in a CompilationConfig. The main use case for this is the ability to use CUDAGraphs without actually using Inductor. However, we still need torch.compile's graph capture (e.g. TorchDynamo) to use CUDAGraphs because we need to capture a graph in order to split it for piecewise CUDAGraphs.

We recommend combining use_inductor=False with custom_ops=['all']. By default, all of the torch.compile configs do not use custom operators (with the exception of attention ops); this allows them to generate better kernels. With use_inductor=False, there is no backend compiler, so we recommend allowing all custom operators.

Here is how to use this in serving and offline inference:

Serving

vllm serve meta-llama/Llama-3.2-1B --compilation-config="{'use_inductor': False, 'custom_ops': ['all']}"

Offline

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model = args.model

compilation_config = {
    'use_inductor': False,
    'custom_ops': ["all"],
}
llm = LLM(model=model, compilation_config=compilation_config)
outputs = llm.generate(prompts, sampling_params)

I also removed some old comments, Woosuk mentioned that they may be outdated.

Test Plan:

ran the above commands
pytest tests/compile/piecewise/test_simple.py && pytest tests/compile/piecewise/test_toy_llama.py

Benchmarks:

I used benchmark_latency.py to benchmark some models with:

the default settings
"eager+CUDAGraphs" --compilation-config="{'use_inductor': False, 'custom_ops': ['all']}"
"eager" --enforce-eager

Performance numbers below:

Model	eager	eager with CUDAGraphs	torch.compile and CUDAGraphs
meta-llama/Meta-Llama-3.1-8B on 1xH100	1.28s	1.23s	1.16s
meta-llama/Meta-Llama-3.1-70B on 8xH100	3.32s	1.96s	1.87s
google/gemma-3-4b-it on 1xH100	2.97s	1.37s	0.78s
Qwen/Qwen3-8B on 1xH100	1.79s	1.41s	1.23s

github-actions · 2025-04-29T02:53:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

houseroad · 2025-04-29T19:43:52Z

tests/compile/piecewise/test_simple.py

nit: test_simple_piecewise_compile_with_inductor

houseroad · 2025-04-29T19:55:02Z

Can we do some simple perf tests? Would like to understand the impact on the perf.

houseroad

Overall, looks good to me.

zou3519 · 2025-05-05T14:35:20Z

@houseroad I updated the PR body with latency benchmarks. Overall, compile+cudagraphs < eager+cudagraphs < eager

mergify · 2025-05-09T20:00:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zou3519.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

houseroad · 2025-05-28T06:58:14Z

Can we rebase this PR?

zou3519 · 2025-05-28T13:43:08Z

Yup will do

Should fix vllm-project#15896, unless users actually want to turn off TorchDyanamo too (keep reading for context) This PR adds the ability to specify `use_inductor=False` in a CompilationConfig. The main use case for this is the ability to use CUDAGraphs without actually using Inductor. However, we still need torch.compile's graph capture (e.g. TorchDynamo) to use CUDAGraphs because we need to capture a graph in order to split it for piecewise CUDAGraphs. `use_inductor=False` can be specified via one of the following two ways: Serving ``` vllm serve meta-llama/Llama-3.2-1B --compilation_config "{'use_inductor': False}" ``` Offline ```py prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) model = args.model compilation_config = { 'use_inductor': False, } llm = LLM(model=model, compilation_config=compilation_config) outputs = llm.generate(prompts, sampling_params) ``` I also removed some old comments, Woosuk mentioned that they may be outdated. Test Plan: - ran the above commands - `pytest tests/compile/piecewise/test_simple.py && pytest tests/compile/piecewise/test_toy_llama.py` Signed-off-by: rzou <[email protected]>

zou3519 · 2025-05-28T17:15:44Z

@houseroad rebased

tlrmchlsmth

This seems handy

zou3519 · 2025-05-28T20:49:34Z

test failures look unrelated (I see "IndexError: list index out of range" on main too)

houseroad

Looks good.

Thanks for enabling this.

…7345) Signed-off-by: rzou <[email protected]> Signed-off-by: amit <[email protected]>

zou3519 marked this pull request as ready for review April 29, 2025 11:58

zou3519 requested review from WoosukKwon, houseroad, mgoin, tlrmchlsmth and youkaichao April 29, 2025 11:59

houseroad reviewed Apr 29, 2025

View reviewed changes

zou3519 force-pushed the allow_eager_backend branch from 7e3907b to 2204838 Compare May 5, 2025 14:34

zou3519 requested a review from houseroad May 5, 2025 14:42

mergify bot added the needs-rebase label May 9, 2025

zou3519 force-pushed the allow_eager_backend branch from 2204838 to b7b835f Compare May 28, 2025 17:07

mergify bot removed the needs-rebase label May 28, 2025

tlrmchlsmth approved these changes May 28, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label May 28, 2025

mgoin approved these changes May 29, 2025

View reviewed changes

houseroad approved these changes May 29, 2025

View reviewed changes

DarkLight1337 merged commit 26b4fa4 into vllm-project:main May 29, 2025
71 checks passed

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

Add ability to use CUDAGraphs with use_inductor=False (vllm-project#1…

a81cb4d

…7345) Signed-off-by: rzou <[email protected]> Signed-off-by: amit <[email protected]>

ProExpertProg mentioned this pull request Jun 16, 2025

[Feature] Support sequence parallelism for static fp8 quantization #19181

Merged

Uh oh!

Add ability to use CUDAGraphs with use_inductor=False #17345

Add ability to use CUDAGraphs with use_inductor=False #17345

Uh oh!

Conversation

zou3519 commented Apr 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

houseroad Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

zou3519 May 5, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad commented Apr 29, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented May 9, 2025

Uh oh!

houseroad commented May 28, 2025

Uh oh!

zou3519 commented May 28, 2025

Uh oh!

zou3519 commented May 28, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 commented May 28, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zou3519 commented Apr 29, 2025 •

edited by github-actions bot

Loading

zou3519 commented May 5, 2025 •

edited

Loading