-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Add ability to use CUDAGraphs with use_inductor=False #17345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: test_simple_piecewise_compile_with_inductor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Can we do some simple perf tests? Would like to understand the impact on the perf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me.
7e3907b
to
2204838
Compare
@houseroad I updated the PR body with latency benchmarks. Overall, compile+cudagraphs < eager+cudagraphs < eager |
This pull request has merge conflicts that must be resolved before it can be |
Can we rebase this PR? |
Yup will do |
Should fix vllm-project#15896, unless users actually want to turn off TorchDyanamo too (keep reading for context) This PR adds the ability to specify `use_inductor=False` in a CompilationConfig. The main use case for this is the ability to use CUDAGraphs without actually using Inductor. However, we still need torch.compile's graph capture (e.g. TorchDynamo) to use CUDAGraphs because we need to capture a graph in order to split it for piecewise CUDAGraphs. `use_inductor=False` can be specified via one of the following two ways: Serving ``` vllm serve meta-llama/Llama-3.2-1B --compilation_config "{'use_inductor': False}" ``` Offline ```py prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) model = args.model compilation_config = { 'use_inductor': False, } llm = LLM(model=model, compilation_config=compilation_config) outputs = llm.generate(prompts, sampling_params) ``` I also removed some old comments, Woosuk mentioned that they may be outdated. Test Plan: - ran the above commands - `pytest tests/compile/piecewise/test_simple.py && pytest tests/compile/piecewise/test_toy_llama.py` Signed-off-by: rzou <[email protected]>
2204838
to
b7b835f
Compare
@houseroad rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems handy
test failures look unrelated (I see "IndexError: list index out of range" on main too) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Thanks for enabling this.
…7345) Signed-off-by: rzou <[email protected]> Signed-off-by: amit <[email protected]>
Should fix #15896, unless users actually want to turn off TorchDynamo too (keep reading for context)
This PR adds the ability to specify
use_inductor=False
in a CompilationConfig. The main use case for this is the ability to use CUDAGraphs without actually using Inductor. However, we still need torch.compile's graph capture (e.g. TorchDynamo) to use CUDAGraphs because we need to capture a graph in order to split it for piecewise CUDAGraphs.We recommend combining
use_inductor=False
withcustom_ops=['all']
. By default, all of the torch.compile configs do not use custom operators (with the exception of attention ops); this allows them to generate better kernels. Withuse_inductor=False
, there is no backend compiler, so we recommend allowing all custom operators.Here is how to use this in serving and offline inference:
Serving
Offline
I also removed some old comments, Woosuk mentioned that they may be outdated.
Test Plan:
pytest tests/compile/piecewise/test_simple.py && pytest tests/compile/piecewise/test_toy_llama.py
Benchmarks:
I used benchmark_latency.py to benchmark some models with:
--compilation-config="{'use_inductor': False, 'custom_ops': ['all']}"
--enforce-eager
Performance numbers below: