Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter #3990

tjoymeed · 2025-04-24T23:40:15Z

Hi all,

I am doing Model Scope MS-SWIFT GRPO RL training with lora.

When resume training from check-point, because I cannot directly do it due to the fact that my GPU cards numbers got reduced (ref: #3989) , so I have to convert the check-point to the merged full model and then start the training from scratch from this merged full model.

And then in the training script, I supply my merged full model path.

swift rlhf
--rlhf_type grpo
--model /myprojects/ms-swift/output/Qwen2.5-7B-32GPUs/v3-20250423-132415/checkpoint-400-mergedfull
--model_type qwen2_5
--train_type lora \

Surprisingly, it hung/stuck after 1 step of training.

The whole program froze...

What's wrong?

Could anybody help?

Thanks!

slin000111 · 2025-04-27T09:20:58Z

pip install py-spy
py-spy dump --pid <pid>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter #3990

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter #3990

tjoymeed commented Apr 24, 2025 •

edited

Loading

slin000111 commented Apr 27, 2025

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter #3990

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter #3990

Comments

tjoymeed commented Apr 24, 2025 • edited Loading

slin000111 commented Apr 27, 2025

tjoymeed commented Apr 24, 2025 •

edited

Loading