-
Notifications
You must be signed in to change notification settings - Fork 637
grpo训练32b模型OOM #3871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you can try
|
I tried lora, but still got OOM |
decrease btw
These options are intended for the vLLM backend. Since you have set |
我也遇到一样的问题,和你一样oom在后面vllm部署的推理服务上。很好奇为什么这里的vllm服务不把tensor_parallel_size和pipeline_parallel_size参数开放出来? |
Met the same problem, have you solved it yet? |
The tensor parallelism for async mode and the 32B full GRPO training script are currently in development. |
32B GRPO full training script https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/multi_node/Qwen2_5_32B_full.sh |
what if I don't have 8*80G GPUs (1 node, 8 GPU per node), but instead have 32*32G NPUs (4 nodes, 8 NPU per node), how should I rewrite the script to support multi node training? |
I want to know why decreasing |
KeyError: 'rollout' when running Qwen2_5_32B_full.sh Traceback (most recent call last): |
same problem, i checked main.py, found that 'rollout' had been removed |
用的tensor parallel 8, offload optimizer,flash attention, vllm,在8*96G 的机器上OOM 下面是具体的配置和报错:
nproc_per_node=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NNODES=$nnodes
NODE_RANK=$RANK
MASTER_ADDR=$MASTER_ADDR
MASTER_PORT=$MASTER_PORT
NPROC_PER_NODE=$nproc_per_node
swift rlhf
--rlhf_type grpo
--model xxxx/xxxxx
--model-type qwq
--attn_impl flash_attn
--gradient_checkpointing true
--reward_funcs reflection_q
--use_vllm false
--vllm_device auto
--vllm_gpu_memory_utilization 0.8
--vllm_max_model_len 8192
--num_infer_workers 8
--train_type full
--torch_dtype bfloat16
--dataset 'xxxxx.jsonl'
--max_completion_length 2048
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 8
--eval_steps 200
--save_steps 200
--save_total_limit 2
--logging_steps 5
--max_length 2048
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 8
--dataset_num_proc 8
--num_generations 8
--temperature 0.9
--deepspeed zero3
--log_completions true
--sleep_level 1
--offload_model true
--offload_optimizer true
--gc_collect_after_offload true
--log_completions true
--tensor_parallel_size 8 \
报错信息:
The text was updated successfully, but these errors were encountered: