Skip to content

Conversation

zou3519
Copy link
Collaborator

@zou3519 zou3519 commented Apr 29, 2025

Should fix #15896, unless users actually want to turn off TorchDynamo too (keep reading for context)

This PR adds the ability to specify use_inductor=False in a CompilationConfig. The main use case for this is the ability to use CUDAGraphs without actually using Inductor. However, we still need torch.compile's graph capture (e.g. TorchDynamo) to use CUDAGraphs because we need to capture a graph in order to split it for piecewise CUDAGraphs.

We recommend combining use_inductor=False with custom_ops=['all']. By default, all of the torch.compile configs do not use custom operators (with the exception of attention ops); this allows them to generate better kernels. With use_inductor=False, there is no backend compiler, so we recommend allowing all custom operators.

Here is how to use this in serving and offline inference:

Serving

vllm serve meta-llama/Llama-3.2-1B --compilation-config="{'use_inductor': False, 'custom_ops': ['all']}"

Offline

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model = args.model

compilation_config = {
    'use_inductor': False,
    'custom_ops': ["all"],
}
llm = LLM(model=model, compilation_config=compilation_config)
outputs = llm.generate(prompts, sampling_params)

I also removed some old comments, Woosuk mentioned that they may be outdated.

Test Plan:

  • ran the above commands
  • pytest tests/compile/piecewise/test_simple.py && pytest tests/compile/piecewise/test_toy_llama.py

Benchmarks:

I used benchmark_latency.py to benchmark some models with:

  1. the default settings
  2. "eager+CUDAGraphs" --compilation-config="{'use_inductor': False, 'custom_ops': ['all']}"
  3. "eager" --enforce-eager

Performance numbers below:

Model eager eager with CUDAGraphs torch.compile and CUDAGraphs
meta-llama/Meta-Llama-3.1-8B on 1xH100 1.28s 1.23s 1.16s
meta-llama/Meta-Llama-3.1-70B on 8xH100 3.32s 1.96s 1.87s
google/gemma-3-4b-it on 1xH100 2.97s 1.37s 0.78s
Qwen/Qwen3-8B on 1xH100 1.79s 1.41s 1.23s

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@zou3519 zou3519 marked this pull request as ready for review April 29, 2025 11:58
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: test_simple_piecewise_compile_with_inductor

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@houseroad
Copy link
Collaborator

Can we do some simple perf tests? Would like to understand the impact on the perf.

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.

@zou3519 zou3519 force-pushed the allow_eager_backend branch from 7e3907b to 2204838 Compare May 5, 2025 14:34
@zou3519
Copy link
Collaborator Author

zou3519 commented May 5, 2025

@houseroad I updated the PR body with latency benchmarks. Overall, compile+cudagraphs < eager+cudagraphs < eager

@zou3519 zou3519 requested a review from houseroad May 5, 2025 14:42
Copy link

mergify bot commented May 9, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zou3519.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 9, 2025
@houseroad
Copy link
Collaborator

Can we rebase this PR?

@zou3519
Copy link
Collaborator Author

zou3519 commented May 28, 2025

Yup will do

Should fix vllm-project#15896, unless users actually want to turn off TorchDyanamo
too (keep reading for context)

This PR adds the ability to specify `use_inductor=False` in a
CompilationConfig. The main use case for this is the ability to use
CUDAGraphs without actually using Inductor. However, we still need
torch.compile's graph capture (e.g. TorchDynamo) to use CUDAGraphs because
we need to capture a graph in order to split it for piecewise CUDAGraphs.

`use_inductor=False` can be specified via one of the following two ways:

Serving
```
vllm serve meta-llama/Llama-3.2-1B --compilation_config "{'use_inductor': False}"
```

Offline
```py
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
model = args.model

compilation_config = {
    'use_inductor': False,
}
llm = LLM(model=model, compilation_config=compilation_config)
outputs = llm.generate(prompts, sampling_params)
```

I also removed some old comments, Woosuk mentioned that they may be outdated.

Test Plan:
- ran the above commands
- `pytest tests/compile/piecewise/test_simple.py && pytest tests/compile/piecewise/test_toy_llama.py`

Signed-off-by: rzou <[email protected]>
@zou3519 zou3519 force-pushed the allow_eager_backend branch from 2204838 to b7b835f Compare May 28, 2025 17:07
@mergify mergify bot removed the needs-rebase label May 28, 2025
@zou3519
Copy link
Collaborator Author

zou3519 commented May 28, 2025

@houseroad rebased

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems handy

@zou3519
Copy link
Collaborator Author

zou3519 commented May 28, 2025

test failures look unrelated (I see "IndexError: list index out of range" on main too)

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label May 28, 2025
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Thanks for enabling this.

@DarkLight1337 DarkLight1337 merged commit 26b4fa4 into vllm-project:main May 29, 2025
71 checks passed
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Enable CUDA Graph without turn on torch.compile / Inductor for V1

5 participants