Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NPROC_PER_NODE=7
swift rlhf
--rlhf_type grpo
--model /models/Qwen2.5-7B-Instruct
--external_plugins examples/train/grpo/plugin/plugin.py
--reward_funcs external_countdown format
--use_vllm true
--vllm_device auto
--vllm_gpu_memory_utilization 0.6
--train_type full
--torch_dtype bfloat16
--dataset 'zouxuhong/Countdown-Tasks-3to4#50000'
--max_length 2048
--max_completion_length 1500
--num_train_epochs 1
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--learning_rate 5e-7
--gradient_accumulation_steps 8
--eval_steps 100
--save_steps 100
--save_total_limit 3
--logging_steps 10
--output_dir xx/ms-swift/saves/grpo/GRPO_COUNTDOWN
--warmup_ratio 0.01
--dataloader_num_workers 4
--num_generations 8
--seed 42
--data_seed 42
--temperature 1.0
--system 'You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.'
--deepspeed zero3
--log_completions true
--vllm_max_model_len 1500
--report_to wandb
--beta 0.001
--num_iterations 1
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
swift3.x
vllm:0.7.2
Additional context
Add any other context about the problem here(在这里补充其他信息)