-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[distributed][rl] remove nccl cumem env var override #24141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly removes a workaround for an old NCCL bug by deleting the code that overrides the NCCL_CUMEM_ENABLE
environment variable. This is a positive change as it allows vLLM to benefit from newer NCCL optimizations. The core logic change in vllm/env_override.py
is sound. I have one suggestion for the documentation update in docs/usage/troubleshooting.md
to improve clarity and prevent potential user confusion, which could lead to system hangs.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for cleaning it up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
) Signed-off-by: youkaichao <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
) Signed-off-by: youkaichao <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
The env var override was initially designed to mitigate NCCL's memory overhead occuring in NCCL 2.19. Reports from NCCL team show that this env var needs to be enabled by default to enable many optimizations. And I have confirmed in NVIDIA/nccl#1234 that NCCL 2.22.3 resolved the memory overhead issue.
Given the above fact, we should remove the env var override to unleash NCCL's full potential.
One caveat is for RL frameworks, they need to be consistent with vllm w.r.t. the env var value. This can be done by checking the vllm's version.
cc @hijkzzz @vermouth1992
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.