Skip to content

Conversation

SageMoore
Copy link
Contributor

@SageMoore SageMoore commented Sep 2, 2025

Purpose

This PR changes the All Reduce backend that we use to synchronize DP padding between ranks from gloo to nccl. This AR starts to be a significant source of overhead when running with multi-node setups. Anecdotally, we've seen the Gloo backed all reduce take ~10ms in a two node cluster setup. Simply swapping over to NCCL brings the over head down to 100s of microseconds.

Test Plan

lm_eval with the following serving command seems sufficient: vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel

I'm happy to run additional models/setups if we think that's necessary.

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3700|±  |0.0279|
|     |       |strict-match    |     5|exact_match|↑  |0.3633|±  |0.0278|

Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
@SageMoore
Copy link
Contributor Author

Here are some profiles for a DP=4 single node run.

With Gloo
Screenshot 2025-08-29 at 4 03 21 PM

With NCCL
Screenshot 2025-08-29 at 4 04 13 PM

@SageMoore SageMoore marked this pull request as ready for review September 11, 2025 14:48
Signed-off-by: Sage Moore <[email protected]>
vllm/envs.py Outdated
Comment on lines 743 to 745
"VLLM_DISABLE_NCCL_DP_PADDING":
lambda: (os.getenv("VLLM_DISABLE_NCCL_DP_PADDING", "False").lower() in
("true", "1")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find this variable name a bit confusing - the way I read it, it sounds like there is some sort of padding that we're disabling. Maybe rename it to VLLM_USE_NCCL_FOR_DP_SYNCHRONIZATION?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment as well? (otherwise PR looks good)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about VLLM_DISABLE_NCCL_FOR_DP_SYNCHRONIZATION?

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 15, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) September 15, 2025 16:08
@tlrmchlsmth tlrmchlsmth merged commit 49bfc53 into vllm-project:main Sep 15, 2025
53 checks passed
@SageMoore SageMoore deleted the sage/gloo-nccl branch September 15, 2025 19:42
tlrmchlsmth pushed a commit to tlrmchlsmth/vllm that referenced this pull request Sep 15, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants