You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I launched GRPO training script on a multi-node cluster (2 nodes with 8xH100 connected with infiniband), but the
training process always hangs
on 5th step without any warning or error message. I was using "Colocate mode" of GRPO since Async mode doesn't
support tensor or pipeline parallelism techniques for inferencing,
which is crucial for big models like meta-llama/Llama-3.3-70B-Instruct. nvidia-smi shows that all GPUs are utilizing
memory but the actual compute utilization was 0%. Could it be that infiniband is not configured properly?
What I've tried
Because training hangs on 5th step, I was thinking that it might be a problem with the dataset. So, I tried different
datasets and enabled dataset shuffling, but it still was hanging on 5th step
I tried to disable vLLM prefix caching, since it wasn't printing "memory free" messages from all GPUs
I tried different --sleep-level arguments
I was using private IP of node, and ping private_ip was working between nodes
I launched GRPO training script on a multi-node cluster (2 nodes with 8xH100 connected with infiniband), but the
training process always hangs
on 5th step without any warning or error message. I was using "Colocate mode" of GRPO since Async mode doesn't
support tensor or pipeline parallelism techniques for inferencing,
which is crucial for big models like
meta-llama/Llama-3.3-70B-Instruct
.nvidia-smi
shows that all GPUs are utilizingmemory but the actual compute utilization was 0%. Could it be that infiniband is not configured properly?
What I've tried
datasets and enabled dataset shuffling, but it still was hanging on 5th step
--sleep-level
argumentsping private_ip
was working between nodesHow to reproduce?
Launch command 1:
Launch command 2:
Debugging info
pip freeze
printenv
ifconfig on node 1
ifconfig on node 2
NCCL logs
The text was updated successfully, but these errors were encountered: