-
Notifications
You must be signed in to change notification settings - Fork 637
multi-node grpo training hangs #3695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Try to set the same parameters in different nodes, such as |
i updated train_grpo_node2.sh to below, but the problem persists: export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
Could you check if version vllm 0.7.3 has the same issue? btw, you can also try SFT to rule out any communication problems. |
我也遇到一样的问题,请问这个解决了吗 |
没问题了。之前是我网络的问题,当前是一切正常。 |
Describe the bug
i run two scripts on two nodes, each with 8 a100, each script content is as below, and it seems the process hangs there. please help, how to fix it?
train_grpo_node1.sh
second script train_grpo_node2.sh
but it hangs as below
script1_result.log
script2_result.log
[
](url)
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Additional context
Add any other context about the problem here(在这里补充其他信息)
The text was updated successfully, but these errors were encountered: