-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Update num_tokens_across_dp to use nccl instead of gloo #24105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
vllm/envs.py
Outdated
"VLLM_DISABLE_NCCL_DP_PADDING": | ||
lambda: (os.getenv("VLLM_DISABLE_NCCL_DP_PADDING", "False").lower() in | ||
("true", "1")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I find this variable name a bit confusing - the way I read it, it sounds like there is some sort of padding that we're disabling. Maybe rename it to VLLM_USE_NCCL_FOR_DP_SYNCHRONIZATION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment as well? (otherwise PR looks good)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about VLLM_DISABLE_NCCL_FOR_DP_SYNCHRONIZATION
?
Signed-off-by: Sage Moore <[email protected]>
…#24105) Signed-off-by: Sage Moore <[email protected]>
…#24105) Signed-off-by: Sage Moore <[email protected]>
…#24105) Signed-off-by: Sage Moore <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
…#24105) Signed-off-by: Sage Moore <[email protected]>
Purpose
This PR changes the All Reduce backend that we use to synchronize DP padding between ranks from gloo to nccl. This AR starts to be a significant source of overhead when running with multi-node setups. Anecdotally, we've seen the Gloo backed all reduce take ~10ms in a two node cluster setup. Simply swapping over to NCCL brings the over head down to 100s of microseconds.
Test Plan
lm_eval with the following serving command seems sufficient:
vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel
I'm happy to run additional models/setups if we think that's necessary.
Test Result