Update num_tokens_across_dp to use nccl instead of gloo #24105

SageMoore · 2025-09-02T14:57:25Z

Purpose

This PR changes the All Reduce backend that we use to synchronize DP padding between ranks from gloo to nccl. This AR starts to be a significant source of overhead when running with multi-node setups. Anecdotally, we've seen the Gloo backed all reduce take ~10ms in a two node cluster setup. Simply swapping over to NCCL brings the over head down to 100s of microseconds.

Test Plan

lm_eval with the following serving command seems sufficient: vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel

I'm happy to run additional models/setups if we think that's necessary.

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3700|±  |0.0279|
|     |       |strict-match    |     5|exact_match|↑  |0.3633|±  |0.0278|

Signed-off-by: Sage Moore <[email protected]>

vllm/forward_context.py

Signed-off-by: Sage Moore <[email protected]>

SageMoore · 2025-09-11T14:38:11Z

Here are some profiles for a DP=4 single node run.

With Gloo

With NCCL

…gloo-nccl

Signed-off-by: Sage Moore <[email protected]>

tlrmchlsmth · 2025-09-15T14:17:38Z

vllm/envs.py

+    "VLLM_DISABLE_NCCL_DP_PADDING":
+    lambda: (os.getenv("VLLM_DISABLE_NCCL_DP_PADDING", "False").lower() in
+             ("true", "1")),


nit: I find this variable name a bit confusing - the way I read it, it sounds like there is some sort of padding that we're disabling. Maybe rename it to VLLM_USE_NCCL_FOR_DP_SYNCHRONIZATION?

Could you add a comment as well? (otherwise PR looks good)

How about VLLM_DISABLE_NCCL_FOR_DP_SYNCHRONIZATION?

Signed-off-by: Sage Moore <[email protected]>

…gloo-nccl

…#24105) Signed-off-by: Sage Moore <[email protected]>

…#24105) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…#24105) Signed-off-by: Sage Moore <[email protected]>

update num_tokens_across_dp to use nccl instead of gloo

2c02b99

Signed-off-by: Sage Moore <[email protected]>

tlrmchlsmth reviewed Sep 2, 2025

View reviewed changes

vllm/forward_context.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Sep 2, 2025

View reviewed changes

vllm/forward_context.py Outdated Show resolved Hide resolved

SageMoore added 3 commits September 8, 2025 13:41

review comments

66e605a

Signed-off-by: Sage Moore <[email protected]>

env var fixes

efa3a3c

Signed-off-by: Sage Moore <[email protected]>

env var fixes

bd40b65

Signed-off-by: Sage Moore <[email protected]>

SageMoore added 2 commits September 11, 2025 14:42

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

748c15c

…gloo-nccl

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

87aa79a

…gloo-nccl

SageMoore marked this pull request as ready for review September 11, 2025 14:48

lint

6f9c88f

Signed-off-by: Sage Moore <[email protected]>

tlrmchlsmth reviewed Sep 15, 2025

View reviewed changes

SageMoore added 2 commits September 15, 2025 15:09

change env var name

1e884dc

Signed-off-by: Sage Moore <[email protected]>

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

8a40756

…gloo-nccl

tlrmchlsmth approved these changes Sep 15, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 15, 2025

tlrmchlsmth enabled auto-merge (squash) September 15, 2025 16:08

tlrmchlsmth merged commit 49bfc53 into vllm-project:main Sep 15, 2025
53 checks passed

SageMoore deleted the sage/gloo-nccl branch September 15, 2025 19:42

tlrmchlsmth pushed a commit to tlrmchlsmth/vllm that referenced this pull request Sep 15, 2025

Update num_tokens_across_dp to use nccl instead of gloo (vllm-project…

14bd619

…#24105) Signed-off-by: Sage Moore <[email protected]>

ZJY0516 mentioned this pull request Sep 22, 2025

[Core] Avoid unnecessary coordination for non-MoE data parallel #24828

Open

5 tasks

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

Update num_tokens_across_dp to use nccl instead of gloo (vllm-project…

fdb52bc

…#24105) Signed-off-by: Sage Moore <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

Update num_tokens_across_dp to use nccl instead of gloo (vllm-project…

9776374

…#24105) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

Update num_tokens_across_dp to use nccl instead of gloo (vllm-project…

9b795df

…#24105) Signed-off-by: Sage Moore <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Update num_tokens_across_dp to use nccl instead of gloo #24105

Update num_tokens_across_dp to use nccl instead of gloo #24105

Uh oh!

SageMoore commented Sep 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

SageMoore commented Sep 11, 2025

Uh oh!

tlrmchlsmth Sep 15, 2025

Uh oh!

tlrmchlsmth Sep 15, 2025

Uh oh!

SageMoore Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Update num_tokens_across_dp to use nccl instead of gloo #24105

Update num_tokens_across_dp to use nccl instead of gloo #24105

Uh oh!

Conversation

SageMoore commented Sep 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

SageMoore commented Sep 11, 2025

Uh oh!

tlrmchlsmth Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

SageMoore Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SageMoore commented Sep 2, 2025 •

edited by github-actions bot

Loading