Skip to content

Bug! Checkpoint resume failure - deepspeed different DP size. Is there a quick checkpoint converter anywere? #3989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tjoymeed opened this issue Apr 24, 2025 · 1 comment

Comments

@tjoymeed
Copy link

tjoymeed commented Apr 24, 2025

Hi all,

I am using the latext MS-SWIFT GRPO LORA training and I run the training on 4x8=32 GPUs.

And now I need to resume training on 2x8=16GPUs.

But simply adding --resume_from_checkpoint doesn't work. The deepspeed complains about different DP size.

I also tried the deepspeed universal converter, but it gave the following errors. How can I fix these bugs?

Also, is the MS-SWIFT GRPO LORA training saved checkpoints the full checkpoints or only the LoRA part without the base model(i.e hasn't combined yet)?


If I run the following command:
python /myprojects/venv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal

I got the following errors:
[2025-04-24 12:59:09,183] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 12:59:09,193] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 474, in main
optim_files = _get_optim_files(args.input_folder)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 432, in _get_optim_files
return _get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 443, in _get_checkpoint_files
raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
FileNotFoundError: can't find *_optim_states.pt files in directory '/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400'

If I run the following command:
python /myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal

I got the following errors:
[2025-04-24 13:02:05,001] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 13:02:05,014] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 482, in main
_check_for_required_state(ds_checkpoint)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 466, in _check_for_required_state
assert universal_checkpoint_info is not None, f'Required {UNIVERSAL_CHECKPOINT_INFO} state is missing in checkpoint. Verify that client creates this state.'
AssertionError: Required universal_checkpoint_info state is missing in checkpoint. Verify that client creates this state.

@tjoymeed
Copy link
Author

I am now trying to convert the lora-adaptor part.

After I convert the lora-adapter part,

I used "--peft-path" parameter in the training shell script to set the path to the new-lora-adapter, but the parameter seems doesn't work.

And then I used the "adapters" parameter in the training shell script and got the following error message:

raise ValueError(f'Please set --model <model_id_or_path>`, model: {self.model}')

ValueError: Please set --model <model_id_or_path>`, model: None

But I have already set it.

My shell script config looks like the following:

--model Qwen/Qwen2.5-7B-Instruct \
--adapters /myprojects/ms-swift/output/Qwen2.5-7B-Instruct/v3-20250423-132415/checkpoint-400-converted-lora-adapter \

What's wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant