Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere? #3989

tjoymeed · 2025-04-24T20:26:29Z

Hi all,

I am using the latext MS-SWIFT GRPO LORA training and I run the training on 4x8=32 GPUs.

And now I need to resume training on 2x8=16GPUs.

But simply adding --resume_from_checkpoint doesn't work. The deepspeed complains about different DP size.

I also tried the deepspeed universal converter, but it gave the following errors. How can I fix these bugs?

Also, is the MS-SWIFT GRPO LORA training saved checkpoints the full checkpoints or only the LoRA part without the base model(i.e hasn't combined yet)?

If I run the following command:
python /myprojects/venv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal

I got the following errors:
[2025-04-24 12:59:09,183] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 12:59:09,193] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 474, in main
optim_files = _get_optim_files(args.input_folder)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 432, in _get_optim_files
return _get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 443, in _get_checkpoint_files
raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
FileNotFoundError: can't find *_optim_states.pt files in directory '/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400'

If I run the following command:
python /myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal

I got the following errors:
[2025-04-24 13:02:05,001] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 13:02:05,014] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 482, in main
_check_for_required_state(ds_checkpoint)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 466, in _check_for_required_state
assert universal_checkpoint_info is not None, f'Required {UNIVERSAL_CHECKPOINT_INFO} state is missing in checkpoint. Verify that client creates this state.'
AssertionError: Required universal_checkpoint_info state is missing in checkpoint. Verify that client creates this state.

tjoymeed · 2025-04-24T23:09:57Z

I am now trying to convert the lora-adaptor part.

After I convert the lora-adapter part,

I used "--peft-path" parameter in the training shell script to set the path to the new-lora-adapter, but the parameter seems doesn't work.

And then I used the "adapters" parameter in the training shell script and got the following error message:

raise ValueError(f'Please set --model <model_id_or_path>`, model: {self.model}')

ValueError: Please set --model <model_id_or_path>`, model: None

But I have already set it.

My shell script config looks like the following:

--model Qwen/Qwen2.5-7B-Instruct \
--adapters /myprojects/ms-swift/output/Qwen2.5-7B-Instruct/v3-20250423-132415/checkpoint-400-converted-lora-adapter \

What's wrong?

tjoymeed mentioned this issue Apr 24, 2025

Bug! Help! MS-SWIFT GRPO + LoRA training hung/stuck after training 1 step from full merged model merged from lora adapter #3990

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere? #3989

Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere? #3989

tjoymeed commented Apr 24, 2025 •

edited

Loading

tjoymeed commented Apr 24, 2025

Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere? #3989

Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere? #3989

Comments

tjoymeed commented Apr 24, 2025 • edited Loading

tjoymeed commented Apr 24, 2025

tjoymeed commented Apr 24, 2025 •

edited

Loading