-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[V1] Revert the default max_num_seqs
to V0 values for most hardware
#16158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Revert the default max_num_seqs
to V0 values for most hardware
#16158
Conversation
Signed-off-by: DarkLight1337 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
max_num_seqs
to 256 for most hardwaremax_num_seqs
to V0 values for most hardware
Signed-off-by: DarkLight1337 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good for now. I would prefer if we could build heuristics on actual hardware specs, like the amount of memory available on each GPU, number of SMs, or even the ratio of model memory to kv cache memory - can follow up with this.
…vllm-project#16158) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Yang Wang <[email protected]>
@DarkLight1337 @mgoin Did we compare the performance before and after this PR? I think we should be more careful in changing this performance-related configs. |
No. But many people reported OOM issues because of the new defaults. We should make V1 compatible with V0 with minimal changes required as we promised originally. |
…vllm-project#16158) Signed-off-by: DarkLight1337 <[email protected]>
…vllm-project#16158) Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Mu Huai <[email protected]>
I also see this problem even in H100 and have to use less gpu_memory_utilize to make 0.8.5 V1 workable . Do you have some ideas? |
The CUDA graph in V1 takes up more memory compared to V0 as it is more comprehensive |
This PR reverts the default
max_num_seqs
from 1024 to 256 for hardware other than H100/200 since many people have run into OOM when using V1.Note: This does not solve #15664 which is about OOM even when using the same
max_num_seqs
as V0.