multi-node grpo training hangs #3695

phoenixbai · 2025-03-27T12:01:32Z

Describe the bug
i run two scripts on two nodes, each with 8 a100, each script content is as below, and it seems the process hangs there. please help, how to fix it?

train_grpo_node1.sh

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=33.187.175.52
export MASTER_PORT=29500
export NPROC_PER_NODE=7

swift rlhf \
    --rlhf_type grpo \
    --model "/mnt/modelhub/Qwen2.5-Math-7B" \
    --reward_funcs accuracy format \
    --use_vllm true \
    --vllm_device "cuda:7" \
    --vllm_gpu_memory_utilization 0.5 \
    --vllm_max_model_len 4096 \
    --num_infer_workers 1 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset '/mnt/datasets/NuminaMath-TIR' \
    --max_completion_length 4096 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 24 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system '/mnt/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true

second script train_grpo_node2.sh

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=1
export MASTER_ADDR=33.187.175.52
export MASTER_PORT=29500
export NPROC_PER_NODE=8

swift rlhf \
    --rlhf_type grpo \
    --model "/mnt/modelhub/Qwen2.5-Math-7B" \
    --reward_funcs accuracy format \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset '/mnt/datasets/NuminaMath-TIR' \
    --max_completion_length 4096 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 24 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system '/mnt/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true

but it hangs as below

script1_result.log
script2_result.log

[

](url)

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

$pip show torch
WARNING: Ignoring invalid distribution -umpy (/opt/conda/lib/python3.10/site-packages)
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /opt/conda/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, bitsandbytes, compressed-tensors, deepspeed, flash_attn, liger_kernel, lighteval, outlines, peft, torchaudio, torchpippy, torchvision, transformer_engine, vllm, xformers, xgrammar

NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 12.4

$ pip show ms-swift
WARNING: Ignoring invalid distribution -umpy (/opt/conda/lib/python3.10/site-packages)
Name: ms_swift
Version: 3.2.2
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /opt/conda/lib/python3.10/site-packages
Requires: accelerate, addict, aiohttp, attrdict, binpacking, charset-normalizer, cpm-kernels, dacite, datasets, einops, fastapi, gradio, importlib-metadata, jieba, matplotlib, modelscope, nltk, numpy, openai, oss2, pandas, peft, pillow, requests, rouge, safetensors, sentencepiece, tensorboard, tiktoken, tqdm, transformers, transformers-stream-generator, trl, uvicorn, zstandard
Required-by:

Additional context
Add any other context about the problem here(在这里补充其他信息)

The text was updated successfully, but these errors were encountered:

hjh0119 · 2025-03-27T12:15:10Z

Try to set the same parameters in different nodes, such as NPROC_PER_NODE/use_vllm

phoenixbai · 2025-03-28T01:55:21Z

Try to set the same parameters in different nodes, such as NPROC_PER_NODE/use_vllm

i updated train_grpo_node2.sh to below, but the problem persists:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=1
export MASTER_ADDR=33.187.175.52
export MASTER_PORT=29500
export NPROC_PER_NODE=8 # if i set it to 7, it throws error at assertion gpusize % local_wordsize == 0

swift rlhf \
    --rlhf_type grpo \
    --model "/mnt/modelhub/Qwen2.5-Math-7B" \
    --reward_funcs accuracy format \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset '/mnt/datasets/NuminaMath-TIR' \
    --max_completion_length 4096 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 24 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system '/mnt/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.5 \
    --vllm_max_model_len 4096 \
    --num_infer_workers 1 \

hjh0119 · 2025-03-28T02:15:51Z

Could you check if version vllm 0.7.3 has the same issue?

btw, you can also try SFT to rule out any communication problems.

Yuccaaa · 2025-03-29T15:14:14Z

我也遇到一样的问题，请问这个解决了吗

phoenixbai · 2025-04-17T09:54:27Z

没问题了。之前是我网络的问题，当前是一切正常。

phoenixbai closed this as completed Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-node grpo training hangs #3695

multi-node grpo training hangs #3695

phoenixbai commented Mar 27, 2025

hjh0119 commented Mar 27, 2025

phoenixbai commented Mar 28, 2025

hjh0119 commented Mar 28, 2025

Yuccaaa commented Mar 29, 2025

phoenixbai commented Apr 17, 2025

multi-node grpo training hangs #3695

multi-node grpo training hangs #3695

Comments

phoenixbai commented Mar 27, 2025

hjh0119 commented Mar 27, 2025

phoenixbai commented Mar 28, 2025

hjh0119 commented Mar 28, 2025

Yuccaaa commented Mar 29, 2025

phoenixbai commented Apr 17, 2025