You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am doing Model Scope MS-SWIFT GRPO RL training with lora.
When resume training from check-point, because I cannot directly do it due to the fact that my GPU cards numbers got reduced (ref: #3989) , so I have to convert the check-point to the merged full model and then start the training from scratch from this merged full model.
And then in the training script, I supply my merged full model path.
Hi all,
I am doing Model Scope MS-SWIFT GRPO RL training with lora.
When resume training from check-point, because I cannot directly do it due to the fact that my GPU cards numbers got reduced (ref: #3989) , so I have to convert the check-point to the merged full model and then start the training from scratch from this merged full model.
And then in the training script, I supply my merged full model path.
swift rlhf
--rlhf_type grpo
--model /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400-mergedfull
--model_type qwen2_5
--train_type lora \
Surprisingly, it hung/stuck after 1 step of training.
The whole program froze...
What's wrong?
Could anybody help?
Thanks!
The text was updated successfully, but these errors were encountered: