Skip to content

multi-node grpo training hangs #3695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
phoenixbai opened this issue Mar 27, 2025 · 5 comments
Closed

multi-node grpo training hangs #3695

phoenixbai opened this issue Mar 27, 2025 · 5 comments

Comments

@phoenixbai
Copy link

Describe the bug
i run two scripts on two nodes, each with 8 a100, each script content is as below, and it seems the process hangs there. please help, how to fix it?

train_grpo_node1.sh

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=0
export MASTER_ADDR=33.187.175.52
export MASTER_PORT=29500
export NPROC_PER_NODE=7

swift rlhf \
    --rlhf_type grpo \
    --model "/mnt/modelhub/Qwen2.5-Math-7B" \
    --reward_funcs accuracy format \
    --use_vllm true \
    --vllm_device "cuda:7" \
    --vllm_gpu_memory_utilization 0.5 \
    --vllm_max_model_len 4096 \
    --num_infer_workers 1 \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset '/mnt/datasets/NuminaMath-TIR' \
    --max_completion_length 4096 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 24 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system '/mnt/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true

second script train_grpo_node2.sh

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=1
export MASTER_ADDR=33.187.175.52
export MASTER_PORT=29500
export NPROC_PER_NODE=8

swift rlhf \
    --rlhf_type grpo \
    --model "/mnt/modelhub/Qwen2.5-Math-7B" \
    --reward_funcs accuracy format \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset '/mnt/datasets/NuminaMath-TIR' \
    --max_completion_length 4096 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 24 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system '/mnt/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true

but it hangs as below

script1_result.log
script2_result.log

[

Image Image

](url)

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

$pip show torch
WARNING: Ignoring invalid distribution -umpy (/opt/conda/lib/python3.10/site-packages)
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /opt/conda/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, bitsandbytes, compressed-tensors, deepspeed, flash_attn, liger_kernel, lighteval, outlines, peft, torchaudio, torchpippy, torchvision, transformer_engine, vllm, xformers, xgrammar

NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 12.4

$ pip show ms-swift
WARNING: Ignoring invalid distribution -umpy (/opt/conda/lib/python3.10/site-packages)
Name: ms_swift
Version: 3.2.2
Summary: Swift: Scalable lightWeight Infrastructure for Fine-Tuning
Home-page: https://github.com/modelscope/swift
Author: DAMO ModelScope teams
Author-email: [email protected]
License: Apache License 2.0
Location: /opt/conda/lib/python3.10/site-packages
Requires: accelerate, addict, aiohttp, attrdict, binpacking, charset-normalizer, cpm-kernels, dacite, datasets, einops, fastapi, gradio, importlib-metadata, jieba, matplotlib, modelscope, nltk, numpy, openai, oss2, pandas, peft, pillow, requests, rouge, safetensors, sentencepiece, tensorboard, tiktoken, tqdm, transformers, transformers-stream-generator, trl, uvicorn, zstandard
Required-by:

Additional context
Add any other context about the problem here(在这里补充其他信息)

@hjh0119
Copy link
Collaborator

hjh0119 commented Mar 27, 2025

Try to set the same parameters in different nodes, such as NPROC_PER_NODE/use_vllm

@phoenixbai
Copy link
Author

Try to set the same parameters in different nodes, such as NPROC_PER_NODE/use_vllm

i updated train_grpo_node2.sh to below, but the problem persists:

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export NNODES=2
export NODE_RANK=1
export MASTER_ADDR=33.187.175.52
export MASTER_PORT=29500
export NPROC_PER_NODE=8 # if i set it to 7, it throws error at assertion gpusize % local_wordsize == 0

swift rlhf \
    --rlhf_type grpo \
    --model "/mnt/modelhub/Qwen2.5-Math-7B" \
    --reward_funcs accuracy format \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset '/mnt/datasets/NuminaMath-TIR' \
    --max_completion_length 4096 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 24 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 2 \
    --eval_steps 200 \
    --save_steps 200 \
    --save_total_limit 2 \
    --logging_steps 1 \
    --max_length 4096 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 8 \
    --temperature 0.9 \
    --system '/mnt/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.5 \
    --vllm_max_model_len 4096 \
    --num_infer_workers 1 \

@hjh0119
Copy link
Collaborator

hjh0119 commented Mar 28, 2025

Could you check if version vllm 0.7.3 has the same issue?

btw, you can also try SFT to rule out any communication problems.

@Yuccaaa
Copy link

Yuccaaa commented Mar 29, 2025

我也遇到一样的问题,请问这个解决了吗

@phoenixbai
Copy link
Author

没问题了。之前是我网络的问题,当前是一切正常。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants