Skip to content

Conversation

youkaichao
Copy link
Member

@youkaichao youkaichao commented Sep 3, 2025

Purpose

The env var override was initially designed to mitigate NCCL's memory overhead occuring in NCCL 2.19. Reports from NCCL team show that this env var needs to be enabled by default to enable many optimizations. And I have confirmed in NVIDIA/nccl#1234 that NCCL 2.22.3 resolved the memory overhead issue.

Given the above fact, we should remove the env var override to unleash NCCL's full potential.

One caveat is for RL frameworks, they need to be consistent with vllm w.r.t. the env var value. This can be done by checking the vllm's version.

cc @hijkzzz @vermouth1992

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
@youkaichao youkaichao requested a review from hmellor as a code owner September 3, 2025 03:11
@mergify mergify bot added the documentation Improvements or additions to documentation label Sep 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly removes a workaround for an old NCCL bug by deleting the code that overrides the NCCL_CUMEM_ENABLE environment variable. This is a positive change as it allows vLLM to benefit from newer NCCL optimizations. The core logic change in vllm/env_override.py is sound. I have one suggestion for the documentation update in docs/usage/troubleshooting.md to improve clarity and prevent potential user confusion, which could lead to system hangs.

youkaichao and others added 2 commits September 3, 2025 11:18
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Copy link
Collaborator

@22quinn 22quinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for cleaning it up!

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@houseroad houseroad added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Sep 3, 2025
@houseroad houseroad enabled auto-merge (squash) September 3, 2025 04:32
@22quinn 22quinn added the rl Related to RL workflows label Sep 3, 2025
@houseroad houseroad merged commit f38035c into vllm-project:main Sep 3, 2025
46 checks passed
@youkaichao youkaichao deleted the remove_cumem branch September 3, 2025 06:49
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
)

Signed-off-by: youkaichao <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed rl Related to RL workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants