Fixed the bug related to saving DeepSpeed models. #6628

HelloWorldBeginner · 2024-01-18T12:54:25Z

What does this PR do?

When using DeepSpeed in the accelerate library to train a model, I encountered an issue while saving the checkpoint. I found that the model in save_model_hook is of type DeepSpeedEngine, which led to an "unexpected save model" error. To resolve this, I needed to unwrap the model, ensuring that it can be compared using isinstance for model type. After making these modifications, the model could be saved correctly.

Fixes # (issue)
fix this bug

after fix this bug ckpt can be saved

Steps:   4%|███▊                                                                                            | 80/2000 [03:38<53:16,  1.66s/it, lr=0.0001, step_loss=0.0303]01/18/2024 20:54:31 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl/checkpoint-80
01/18/2024 20:54:31 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2024-01-18 20:54:31,401] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be saved!
[2024-01-18 20:54:32,867] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/mp_rank_00_model_states.pt
[2024-01-18 20:54:32,868] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/mp_rank_00_model_states.pt...
[2024-01-18 20:54:46,415] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/mp_rank_00_model_states.pt.
[2024-01-18 20:54:46,472] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,473] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_1_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,473] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_3_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,473] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_2_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,474] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_5_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,473] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_4_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,476] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_6_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,477] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_7_mp_rank_00_optim_states.pt...
[2024-01-18 20:54:46,499] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-01-18 20:54:46,499] [INFO] [engine.py:3431:_save_zero_checkpoint] zero checkpoint saved sd-pokemon-model-lora-sdxl/checkpoint-80/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-01-18 20:54:46,499] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul @patrickvonplaten

HF projects:

accelerate: different repo
datasets: different repo
transformers: different repo
safetensors: different repo

-->

HuggingFaceDocBuilderDev · 2024-01-18T13:50:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2024-01-18T13:57:41Z

Thanks! Can you share an example training command to check with DeepSpeed?

I checked if the changes worked without DeepSpeed and they do: https://colab.research.google.com/gist/sayakpaul/6d60b261a42e0e9fb07c0e9505e7b82f/scratchpad.ipynb

HelloWorldBeginner · 2024-01-19T01:37:10Z

Thanks! Can you share an example training command to check with DeepSpeed?

I checked if the changes worked without DeepSpeed and they do: https://colab.research.google.com/gist/sayakpaul/6d60b261a42e0e9fb07c0e9505e7b82f/scratchpad.ipynb

Thank you for your reply, here is my training script, datasets are from huggingface, I use single A100.

train shell

export MODEL_NAME="/home/mhh/sd_models/stable-diffusion-xl-base-1.0"
export VAE_NAME="/home/mhh/sd_models/sdxl-vae-fp16-fix"
export DATASET_NAME="/home/mhh/sd_datasets/pokemon-blip-captions"
export CUDA_VISIBLE_DEVICES="6"

DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
TEST_NAME="lora_fsdp_fp16"

accelerate launch  --config_file "./lora_dp_accelerate.yaml"  --main_process_port 12504 train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=1024  \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --checkpointing_steps=2 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --max_train_steps=2000 \
  --validation_epochs=2000 \
  --seed=1234 \
  --output_dir="sd-pokemon-model-lora-sdxl" \
  --validation_prompt="cute dragon creature" | tee dp_logs/${TEST_NAME}_${DATETIME}.log

Here's my accelerate config file lora_dp_accelerate.yaml, training with deepspeed.

compute_environment: LOCAL_MACHINE
debug: true
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

sayakpaul · 2024-01-19T02:11:26Z

Thanks! And did you observe any speedups?

Additionally, if you could also modify the README_sdxl.md file to include a note about DeepSpeed, that would be nice. We can then merge ;-)

HelloWorldBeginner · 2024-01-19T02:40:29Z

Thanks! And did you observe any speedups?

Additionally, if you could also modify the README_sdxl.md file to include a note about DeepSpeed, that would be nice. We can then merge ;-)

After using DeepSpeed, the GPU memory usage significantly decreases.
DeepSpeed can reduce the consumption of GPU memory, enabling the training of models on GPUs with smaller memory sizes.

I can add this information about DeepSpeed to the README file later.

sayakpaul · 2024-01-19T02:41:57Z

That's very good to know. Let's add this info to the README and we can then merge :)

* origin/main: Fix failing tests due to Posix Path (huggingface#6627)

HelloWorldBeginner · 2024-01-19T06:12:44Z

That's very good to know. Let's add this info to the README and we can then merge :)

I have updated the README with instructions on how to train the SDXL model using DeepSpeed, please check :)

examples/text_to_image/README_sdxl.md

sayakpaul · 2024-01-19T10:24:26Z

Looking fantastic. Will merge once the CI is green.

HelloWorldBeginner · 2024-01-19T13:50:57Z

Looking fantastic. Will merge once the CI is green.

Thank you for your review, I hope it can help diffusers.

sayakpaul · 2024-01-19T13:52:07Z

Of course it will!

* Fixed the bug related to saving DeepSpeed models. * Add information about training SD models using DeepSpeed to the README. * Apply suggestions from code review --------- Co-authored-by: mhh001 <[email protected]> Co-authored-by: Sayak Paul <[email protected]>

Fixed the bug related to saving DeepSpeed models.

3b5062c

Merge branch 'main' into main

0e4be78

mhh001 added 2 commits January 19, 2024 14:09

Add information about training SD models using DeepSpeed to the README.

a140288

Merge remote-tracking branch 'origin/main'

1b28717

* origin/main: Fix failing tests due to Posix Path (huggingface#6627)

sayakpaul reviewed Jan 19, 2024

View reviewed changes

examples/text_to_image/README_sdxl.md Outdated Show resolved Hide resolved

sayakpaul reviewed Jan 19, 2024

View reviewed changes

examples/text_to_image/README_sdxl.md Outdated Show resolved Hide resolved

Apply suggestions from code review

5096d13

sayakpaul added the Good Example PR label Jan 19, 2024

sayakpaul merged commit f95615b into huggingface:main Jan 19, 2024

sayakpaul mentioned this pull request Jan 25, 2024

accelerate + FSDP + T2I train saving ckpt error #6705

Closed

sayakpaul mentioned this pull request Mar 14, 2024

train_text_to_image_sdxl.py Can't save model at checkpoint #7311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed the bug related to saving DeepSpeed models. #6628

Fixed the bug related to saving DeepSpeed models. #6628

Uh oh!

HelloWorldBeginner commented Jan 18, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jan 18, 2024

Uh oh!

sayakpaul commented Jan 18, 2024 •

edited

Loading

Uh oh!

HelloWorldBeginner commented Jan 19, 2024 •

edited

Loading

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

HelloWorldBeginner commented Jan 19, 2024

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

HelloWorldBeginner commented Jan 19, 2024

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

HelloWorldBeginner commented Jan 19, 2024

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

Uh oh!

Fixed the bug related to saving DeepSpeed models. #6628

Fixed the bug related to saving DeepSpeed models. #6628

Uh oh!

Conversation

HelloWorldBeginner commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 18, 2024

Uh oh!

sayakpaul commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HelloWorldBeginner commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

HelloWorldBeginner commented Jan 19, 2024

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

HelloWorldBeginner commented Jan 19, 2024

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

HelloWorldBeginner commented Jan 19, 2024

Uh oh!

sayakpaul commented Jan 19, 2024

Uh oh!

Uh oh!

HelloWorldBeginner commented Jan 18, 2024 •

edited

Loading

sayakpaul commented Jan 18, 2024 •

edited

Loading

HelloWorldBeginner commented Jan 19, 2024 •

edited

Loading