Skip to content

grpo训练32b模型OOM #3871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zhilinwang1 opened this issue Apr 14, 2025 · 11 comments
Open

grpo训练32b模型OOM #3871

zhilinwang1 opened this issue Apr 14, 2025 · 11 comments

Comments

@zhilinwang1
Copy link

用的tensor parallel 8, offload optimizer,flash attention, vllm,在8*96G 的机器上OOM 下面是具体的配置和报错:
nproc_per_node=8

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$nnodes
NODE_RANK=$RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type grpo
--model xxxx/xxxxx
--model-type qwq
--attn_impl flash_attn
--gradient_checkpointing true
--reward_funcs reflection_q
--use_vllm false
--vllm_device auto
--vllm_gpu_memory_utilization 0.8
--vllm_max_model_len 8192
--num_infer_workers 8
--train_type full
--torch_dtype bfloat16
--dataset 'xxxxx.jsonl'
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 8
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 2048
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 8
--dataset_num_proc 8
--num_generations 8
--temperature 0.9
--deepspeed zero3
--log_completions true
--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true
--log_completions true
--tensor_parallel_size 8 \

报错信息:

Image
@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 14, 2025

you can try

  1. --deepspeed zero3_offload
  2. --beta 0 to diasble ref model

@zhilinwang1
Copy link
Author

you can try

  1. --deepspeed zero3_offload
  2. --beta 0 to diasble ref model
  1. I've tried to offload the param as well, but still got OOM,
  2. but doing so would remove kl constraint. Does the ref model follow tensor parallel = 8 arg?

I tried lora, but still got OOM
any other suggestions
anything I can provide that benefits error locating

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 14, 2025

decrease vllm_gpu_memory_utilization

btw

--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true

These options are intended for the vLLM backend. Since you have set --use_vllm false, the above arguments will not take effect. Perhaps setting --use_vllm true will work

@AUFEfzx
Copy link

AUFEfzx commented Apr 19, 2025

我也遇到一样的问题,和你一样oom在后面vllm部署的推理服务上。很好奇为什么这里的vllm服务不把tensor_parallel_size和pipeline_parallel_size参数开放出来?

@miyeeee
Copy link

miyeeee commented Apr 21, 2025

Met the same problem, have you solved it yet?

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 21, 2025

The tensor parallelism for async mode and the 32B full GRPO training script are currently in development.

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 23, 2025

@miyeeee
Copy link

miyeeee commented Apr 24, 2025

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

what if I don't have 8*80G GPUs (1 node, 8 GPU per node), but instead have 32*32G NPUs (4 nodes, 8 NPU per node), how should I rewrite the script to support multi node training?

@heyubox
Copy link

heyubox commented Apr 25, 2025

decrease vllm_gpu_memory_utilization

btw

--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true

These options are intended for the vLLM backend. Since you have set --use_vllm false, the above arguments will not take effect. Perhaps setting --use_vllm true will work

I want to know why decreasing vllm_gpu_memory_utilization works. Setting sleep_level and offload isn't supposed to free up all GPU memory occupied by the VLLM backend? I assumed the training would reach a peak during offloading, but I want to know more about the details. Thanks!

@GuliGuli-Boom
Copy link

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

KeyError: 'rollout' when running Qwen2_5_32B_full.sh

Traceback (most recent call last):
File "/miniconda3/envs/SWIFT/bin/swift", line 33, in
sys.exit(load_entry_point('ms-swift', 'console_scripts', 'swift')())
File "/ms-swift-main/swift/cli/main.py", line 61, in cli_main
file_path = importlib.util.find_spec(route_mapping[method_name]).origin
KeyError: 'rollout'

@ViktorJiangC
Copy link

32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh

KeyError: 'rollout' when running Qwen2_5_32B_full.sh

Traceback (most recent call last): File "/miniconda3/envs/SWIFT/bin/swift", line 33, in sys.exit(load_entry_point('ms-swift', 'console_scripts', 'swift')()) File "/ms-swift-main/swift/cli/main.py", line 61, in cli_main file_path = importlib.util.find_spec(route_mapping[method_name]).origin KeyError: 'rollout'

same problem, i checked main.py, found that 'rollout' had been removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants