Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)


bash如下
MAX_PIXELS=262144
MASTER_PORT=29600
NPROC_PER_NODE=6
swift rlhf
--rlhf_type grpo
--model /root/Qwen2.5vl-3B
--external_plugins /root/lml/wcq/ms-swift/examples/train/grpo/plugin/plugin.py
--reward_funcs external_r1v_acc format
--use_vllm true
--vllm_device auto
--vllm_gpu_memory_utilization 0.9
--train_type full
--torch_dtype bfloat16
--dataset 'lmms-lab/multimodal-open-r1-8k-verified'
--max_length 8192
--max_completion_length 1024
--num_train_epochs 1
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--learning_rate 1e-6
--gradient_accumulation_steps 2
--save_strategy 'steps'
--eval_strategy 'steps'
--eval_steps 400
--save_steps 400
--save_total_limit 10
--logging_steps 1
--output_dir output/GRPO_GEOQA
--warmup_ratio 0.05
--dataloader_num_workers 4
--num_generations 2
--temperature 1.0
--repetition_penalty 1.1
--system '/root/lml/wcq/ms-swift/examples/train/grpo/prompt.txt'
--deepspeed zero3
--log_completions true
--num_iterations 2
--num_infer_workers 2
--async_generate false
--beta 0.001
--max_grad_norm 0.5
Additional context
Add any other context about the problem here(在这里补充其他信息)
系统是8张3090