-
Notifications
You must be signed in to change notification settings - Fork 637
Qwen25VL 72B GRPO training (lora) would hang for no reason. #3592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I use ms-swift main branch of Mar 17th. |
+1 |
@Jintao-Huang @hjh0119 Hi guys, have you ever tried Qwen25VL 72B GRPO training with lora? Could you please share any possible best practice? The hanging problem is quite weird. I don't even modify the source code. |
checking |
the training script for VL72B is on the way |
72B VL GRPO training scipt https://github.com/modelscope/ms-swift/blob/main/examples/train/grpo/lora_qwenvl72b.sh |
I would try ASAP! Thank you for this great job! |
Hi guys,
I use 8 * A100 80g to implement qwen25vl 72B grpo training. But when starting, the whole process hang in the begining.
Do you have any ideas to solve this problem?
Or any best practice about using qwen25vl 72B into grpo training?
Here is my sh command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
NPROC_PER_NODE=8
MAX_PIXELS=640000
swift rlhf
--rlhf_type grpo
--model /mnt2/models/Qwen__Qwen2.5-VL-72B-Instruct
--train_type lora
--dataset /ossfs/workspace/data_process/xx.json
--torch_dtype bfloat16
--num_train_epochs 1
--max_length 2048
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--eval_steps 1000
--save_steps 2000
--learning_rate 1e-6
--save_total_limit 2
--logging_steps 1
--output_dir /mnt2/xx
--warmup_ratio 0.05
--dataloader_num_workers 4
--max_completion_length 2048
--external_plugins examples/train/grpo/plugin/plugin.py
--reward_funcs external_ui_acc uiformat
--num_generations 4
--use_vllm true
--vllm_gpu_memory_utilization 0.3
--vllm_max_model_len 2048
--deepspeed zero3_offload
--temperature 1.1
--top_p 1.0
--top_k 80
--log_completions true
--num_infer_workers 8
--tensor_parallel_size 8
--async_generate false
--offload_optimizer true
--offload_model true
--gc_collect_after_offload true
--move_model_batches 16
--sleep_level 1
--report_to swanlab \
Here are my related libraries:
vllm 0.7.3
trl 0.16.0.dev0
transformers 4.49.0
torch 2.5.1+cu121
peft 0.14.0
The text was updated successfully, but these errors were encountered: