You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the latext MS-SWIFT GRPO LORA training and I run the training on 4x8=32 GPUs.
And now I need to resume training on 2x8=16GPUs.
But simply adding --resume_from_checkpoint doesn't work. The deepspeed complains about different DP size.
I also tried the deepspeed universal converter, but it gave the following errors. How can I fix these bugs?
Also, is the MS-SWIFT GRPO LORA training saved checkpoints the full checkpoints or only the LoRA part without the base model(i.e hasn't combined yet)?
If I run the following command:
python /myprojects/venv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
I got the following errors:
[2025-04-24 12:59:09,183] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 12:59:09,193] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 474, in main
optim_files = _get_optim_files(args.input_folder)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 432, in _get_optim_files
return _get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 443, in _get_checkpoint_files
raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
FileNotFoundError: can't find *_optim_states.pt files in directory '/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400'
If I run the following command:
python /myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
I got the following errors:
[2025-04-24 13:02:05,001] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 13:02:05,014] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 482, in main
_check_for_required_state(ds_checkpoint)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 466, in _check_for_required_state
assert universal_checkpoint_info is not None, f'Required {UNIVERSAL_CHECKPOINT_INFO} state is missing in checkpoint. Verify that client creates this state.'
AssertionError: Required universal_checkpoint_info state is missing in checkpoint. Verify that client creates this state.
The text was updated successfully, but these errors were encountered:
Hi all,
I am using the latext MS-SWIFT GRPO LORA training and I run the training on 4x8=32 GPUs.
And now I need to resume training on 2x8=16GPUs.
But simply adding --resume_from_checkpoint doesn't work. The deepspeed complains about different DP size.
I also tried the deepspeed universal converter, but it gave the following errors. How can I fix these bugs?
Also, is the MS-SWIFT GRPO LORA training saved checkpoints the full checkpoints or only the LoRA part without the base model(i.e hasn't combined yet)?
If I run the following command:
python /myprojects/venv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
I got the following errors:
[2025-04-24 12:59:09,183] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 12:59:09,193] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400 to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400-universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 474, in main
optim_files = _get_optim_files(args.input_folder)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 432, in _get_optim_files
return _get_checkpoint_files(checkpoint_dir, "*_optim_states.pt")
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 443, in _get_checkpoint_files
raise FileNotFoundError(f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
FileNotFoundError: can't find *_optim_states.pt files in directory '/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400'
If I run the following command:
python /myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py --input_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ --output_folder /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
I got the following errors:
[2025-04-24 13:02:05,001] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2025-04-24 13:02:05,014] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)
args = Namespace(input_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/', output_folder='/myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=False, strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400/ to Universal checkpoint in /myprojects/ms-swift/output/Qwen2.5-7B-Instruct-GRPO-24000-4.17-32GPUs/v3-20250423-132415/checkpoint-400/global_step400_universal
Traceback (most recent call last):
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 549, in
main(args)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 482, in main
_check_for_required_state(ds_checkpoint)
File "/myprojects/myvenv_msswift/lib/python3.10/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 466, in _check_for_required_state
assert universal_checkpoint_info is not None, f'Required {UNIVERSAL_CHECKPOINT_INFO} state is missing in checkpoint. Verify that client creates this state.'
AssertionError: Required universal_checkpoint_info state is missing in checkpoint. Verify that client creates this state.
The text was updated successfully, but these errors were encountered: