-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[DP] Create placement groups by ray_device_key #25026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DP] Create placement groups by ray_device_key #25026
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the data parallel placement group creation logic to use a dynamic device key from current_platform.ray_device_key
instead of a hardcoded "GPU" string. This is a good improvement for supporting various hardware backends like TPUs. My review identifies a potential KeyError
if data parallelism is attempted on a platform where ray_device_key
is empty. I've suggested adding assertions to provide a clear error message in such cases.
"No nodes with resources found in Ray cluster.") | ||
assert dp_master_ip_key in nodes[0], ( | ||
"The DP master node (ip: %s) is missing or dead", dp_master_ip) | ||
device_str = current_platform.ray_device_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ray_device_key
can be an empty string for platforms that do not support Ray. If device_str
is empty, it will cause a KeyError
when used to access node_resources
on line 345. It's better to add an assertion to ensure device_str
is not empty, which provides an early and clear error message if data parallelism is attempted on an unsupported platform.
device_str = current_platform.ray_device_key
assert device_str, (
"current_platform.ray_device_key is empty, indicating that data "
"parallelism with Ray is not supported on this platform.")
local_dp_ranks = [] | ||
num_pg_created = 0 | ||
|
||
device_str = current_platform.ray_device_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the create_dp_placement_groups
function, ray_device_key
can be an empty string here. If device_str
is empty, it will cause a KeyError
when used to access available_resources
and total_resources
on lines 427 and 431. An assertion should be added to ensure device_str
is not empty and fail early on unsupported platforms.
device_str = current_platform.ray_device_key
assert device_str, (
"current_platform.ray_device_key is empty, indicating that data "
"parallelism with Ray is not supported on this platform.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Signed-off-by: Xinyu Chen <[email protected]> Co-authored-by: Kunshang Ji <[email protected]>
Signed-off-by: Xinyu Chen <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: Xinyu Chen <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Xinyu Chen <[email protected]> Co-authored-by: Kunshang Ji <[email protected]>
to support create placement groups for dp with other kinds of ray device.