Skip to content

Conversation

xinyu-intel
Copy link
Contributor

to support create placement groups for dp with other kinds of ray device.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the v1 label Sep 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the data parallel placement group creation logic to use a dynamic device key from current_platform.ray_device_key instead of a hardcoded "GPU" string. This is a good improvement for supporting various hardware backends like TPUs. My review identifies a potential KeyError if data parallelism is attempted on a platform where ray_device_key is empty. I've suggested adding assertions to provide a clear error message in such cases.

"No nodes with resources found in Ray cluster.")
assert dp_master_ip_key in nodes[0], (
"The DP master node (ip: %s) is missing or dead", dp_master_ip)
device_str = current_platform.ray_device_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ray_device_key can be an empty string for platforms that do not support Ray. If device_str is empty, it will cause a KeyError when used to access node_resources on line 345. It's better to add an assertion to ensure device_str is not empty, which provides an early and clear error message if data parallelism is attempted on an unsupported platform.

        device_str = current_platform.ray_device_key
        assert device_str, (
            "current_platform.ray_device_key is empty, indicating that data "
            "parallelism with Ray is not supported on this platform.")

local_dp_ranks = []
num_pg_created = 0

device_str = current_platform.ray_device_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the create_dp_placement_groups function, ray_device_key can be an empty string here. If device_str is empty, it will cause a KeyError when used to access available_resources and total_resources on lines 427 and 431. An assertion should be added to ensure device_str is not empty and fail early on unsupported platforms.

        device_str = current_platform.ray_device_key
        assert device_str, (
            "current_platform.ray_device_key is empty, indicating that data "
            "parallelism with Ray is not supported on this platform.")

Copy link
Collaborator

@jikunshang jikunshang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jikunshang jikunshang added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025
@jikunshang jikunshang enabled auto-merge (squash) September 17, 2025 06:47
@jikunshang jikunshang merged commit bb58dc8 into vllm-project:main Sep 17, 2025
42 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Xinyu Chen <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Signed-off-by: charlifu <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Xinyu Chen <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants