Skip to content

Commit ddd8bd5

Browse files
authored
[docs] LCM training (huggingface#5796)
* first draft * feedback
1 parent 9f7b2cf commit ddd8bd5

File tree

2 files changed

+257
-0
lines changed

2 files changed

+257
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,8 @@
129129
title: LoRA
130130
- local: training/custom_diffusion
131131
title: Custom Diffusion
132+
- local: training/lcm_distill
133+
title: Latent Consistency Distillation
132134
- local: training/ddpo
133135
title: Reinforcement learning training with DDPO
134136
title: Methods
Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Latent Consistency Distillation
14+
15+
[Latent Consistency Models (LCMs)](https://hf.co/papers/2310.04378) are able to generate high-quality images in just a few steps, representing a big leap forward because many pipelines require at least 25+ steps. LCMs are produced by applying the latent consistency distillation method to any Stable Diffusion model. This method works by applying *one-stage guided distillation* to the latent space, and incorporating a *skipping-step* method to consistently skip timesteps to accelerate the distillation process (refer to section 4.1, 4.2, and 4.3 of the paper for more details).
16+
17+
If you're training on a GPU with limited vRAM, try enabling `gradient_checkpointing`, `gradient_accumulation_steps`, and `mixed_precision` to reduce memory-usage and speedup training. You can reduce your memory-usage even more by enabling memory-efficient attention with [xFormers](../optimization/xformers) and [bitsandbytes'](https://github.com/TimDettmers/bitsandbytes) 8-bit optimizer.
18+
19+
This guide will explore the [train_lcm_distill_sd_wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_sd_wds.py) script to help you become more familiar with it, and how you can adapt it for your own use-case.
20+
21+
Before running the script, make sure you install the library from source:
22+
23+
```bash
24+
git clone https://github.com/huggingface/diffusers
25+
cd diffusers
26+
pip install .
27+
```
28+
29+
Then navigate to the example folder containing the training script and install the required dependencies for the script you're using:
30+
31+
```bash
32+
cd examples/consistency_distillation
33+
pip install -r requirements.txt
34+
```
35+
36+
<Tip>
37+
38+
🤗 Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the 🤗 Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more.
39+
40+
</Tip>
41+
42+
Initialize an 🤗 Accelerate environment (try enabling `torch.compile` to significantly speedup training):
43+
44+
```bash
45+
accelerate config
46+
```
47+
48+
To setup a default 🤗 Accelerate environment without choosing any configurations:
49+
50+
```bash
51+
accelerate config default
52+
```
53+
54+
Or if your environment doesn't support an interactive shell, like a notebook, you can use:
55+
56+
```bash
57+
from accelerate.utils import write_basic_config
58+
59+
write_basic_config()
60+
```
61+
62+
Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script.
63+
64+
## Script parameters
65+
66+
<Tip>
67+
68+
The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_sd_wds.py) and let us know if you have any questions or concerns.
69+
70+
</Tip>
71+
72+
The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L419) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like.
73+
74+
For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command:
75+
76+
```bash
77+
accelerate launch train_lcm_distill_sd_wds.py \
78+
--mixed_precision="fp16"
79+
```
80+
81+
Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so you'll focus on the parameters that are relevant to latent consistency distillation in this guide.
82+
83+
- `--pretrained_teacher_model`: the path to a pretrained latent diffusion model to use as the teacher model
84+
- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify an alternative VAE (like this [VAE]((https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)) by madebyollin which works in fp16)
85+
- `--w_min` and `--w_max`: the minimum and maximum guidance scale values for guidance scale sampling
86+
- `--num_ddim_timesteps`: the number of timesteps for DDIM sampling
87+
- `--loss_type`: the type of loss (L2 or Huber) to calculate for latent consistency distillation; Huber loss is generally preferred because it's more robust to outliers
88+
- `--huber_c`: the Huber loss parameter
89+
90+
## Training script
91+
92+
The training script starts by creating a dataset class - [`Text2ImageDataset`](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L141) - for preprocessing the images and creating a training dataset.
93+
94+
```py
95+
def transform(example):
96+
image = example["image"]
97+
image = TF.resize(image, resolution, interpolation=transforms.InterpolationMode.BILINEAR)
98+
99+
c_top, c_left, _, _ = transforms.RandomCrop.get_params(image, output_size=(resolution, resolution))
100+
image = TF.crop(image, c_top, c_left, resolution, resolution)
101+
image = TF.to_tensor(image)
102+
image = TF.normalize(image, [0.5], [0.5])
103+
104+
example["image"] = image
105+
return example
106+
```
107+
108+
For improved performance on reading and writing large datasets stored in the cloud, this script uses the [WebDataset](https://github.com/webdataset/webdataset) format to create a preprocessing pipeline to apply transforms and create a dataset and dataloader for training. Images are processed and fed to the training loop without having to download the full dataset first.
109+
110+
```py
111+
processing_pipeline = [
112+
wds.decode("pil", handler=wds.ignore_and_continue),
113+
wds.rename(image="jpg;png;jpeg;webp", text="text;txt;caption", handler=wds.warn_and_continue),
114+
wds.map(filter_keys({"image", "text"})),
115+
wds.map(transform),
116+
wds.to_tuple("image", "text"),
117+
]
118+
```
119+
120+
In the [`main()`](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L768) function, all the necessary components like the noise scheduler, tokenizers, text encoders, and VAE are loaded. The teacher UNet is also loaded here and then you can create a student UNet from the teacher UNet. The student UNet is updated by the optimizer during training.
121+
122+
```py
123+
teacher_unet = UNet2DConditionModel.from_pretrained(
124+
args.pretrained_teacher_model, subfolder="unet", revision=args.teacher_revision
125+
)
126+
127+
unet = UNet2DConditionModel(**teacher_unet.config)
128+
unet.load_state_dict(teacher_unet.state_dict(), strict=False)
129+
unet.train()
130+
```
131+
132+
Now you can create the [optimizer](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L979) to update the UNet parameters:
133+
134+
```py
135+
optimizer = optimizer_class(
136+
unet.parameters(),
137+
lr=args.learning_rate,
138+
betas=(args.adam_beta1, args.adam_beta2),
139+
weight_decay=args.adam_weight_decay,
140+
eps=args.adam_epsilon,
141+
)
142+
```
143+
144+
Create the [dataset](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L994):
145+
146+
```py
147+
dataset = Text2ImageDataset(
148+
train_shards_path_or_url=args.train_shards_path_or_url,
149+
num_train_examples=args.max_train_samples,
150+
per_gpu_batch_size=args.train_batch_size,
151+
global_batch_size=args.train_batch_size * accelerator.num_processes,
152+
num_workers=args.dataloader_num_workers,
153+
resolution=args.resolution,
154+
shuffle_buffer_size=1000,
155+
pin_memory=True,
156+
persistent_workers=True,
157+
)
158+
train_dataloader = dataset.train_dataloader
159+
```
160+
161+
Next, you're ready to setup the [training loop](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L1049) and implement the latent consistency distillation method (see Algorithm 1 in the paper for more details). This section of the script takes care of adding noise to the latents, sampling and creating a guidance scale embedding, and predicting the original image from the noise.
162+
163+
```py
164+
pred_x_0 = predicted_origin(
165+
noise_pred,
166+
start_timesteps,
167+
noisy_model_input,
168+
noise_scheduler.config.prediction_type,
169+
alpha_schedule,
170+
sigma_schedule,
171+
)
172+
173+
model_pred = c_skip_start * noisy_model_input + c_out_start * pred_x_0
174+
```
175+
176+
It gets the [teacher model predictions](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L1172) and the [LCM predictions](https://github.com/huggingface/diffusers/blob/3b37488fa3280aed6a95de044d7a42ffdcb565ef/examples/consistency_distillation/train_lcm_distill_sd_wds.py#L1209) next, calculates the loss, and then backpropagates it to the LCM.
177+
178+
```py
179+
if args.loss_type == "l2":
180+
loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
181+
elif args.loss_type == "huber":
182+
loss = torch.mean(
183+
torch.sqrt((model_pred.float() - target.float()) ** 2 + args.huber_c**2) - args.huber_c
184+
)
185+
```
186+
187+
If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers tutorial](../using-diffusers/write_own_pipeline) which breaks down the basic pattern of the denoising process.
188+
189+
## Launch the script
190+
191+
Now you're ready to launch the training script and start distilling!
192+
193+
For this guide, you'll use the `--train_shards_path_or_url` to specify the path to the [Conceptual Captions 12M](https://github.com/google-research-datasets/conceptual-12m) dataset stored on the Hub [here](https://huggingface.co/datasets/laion/conceptual-captions-12m-webdataset). Set the `MODEL_DIR` environment variable to the name of the teacher model and `OUTPUT_DIR` to where you want to save the model.
194+
195+
```bash
196+
export MODEL_DIR="runwayml/stable-diffusion-v1-5"
197+
export OUTPUT_DIR="path/to/saved/model"
198+
199+
accelerate launch train_lcm_distill_sd_wds.py \
200+
--pretrained_teacher_model=$MODEL_DIR \
201+
--output_dir=$OUTPUT_DIR \
202+
--mixed_precision=fp16 \
203+
--resolution=512 \
204+
--learning_rate=1e-6 --loss_type="huber" --ema_decay=0.95 --adam_weight_decay=0.0 \
205+
--max_train_steps=1000 \
206+
--max_train_samples=4000000 \
207+
--dataloader_num_workers=8 \
208+
--train_shards_path_or_url="pipe:curl -L -s https://huggingface.co/datasets/laion/conceptual-captions-12m-webdataset/resolve/main/data/{00000..01099}.tar?download=true" \
209+
--validation_steps=200 \
210+
--checkpointing_steps=200 --checkpoints_total_limit=10 \
211+
--train_batch_size=12 \
212+
--gradient_checkpointing --enable_xformers_memory_efficient_attention \
213+
--gradient_accumulation_steps=1 \
214+
--use_8bit_adam \
215+
--resume_from_checkpoint=latest \
216+
--report_to=wandb \
217+
--seed=453645634 \
218+
--push_to_hub
219+
```
220+
221+
Once training is complete, you can use your new LCM for inference.
222+
223+
```py
224+
from diffusers import UNet2DConditionModel, DiffusionPipeline, LCMScheduler
225+
import torch
226+
227+
unet = UNet2DConditionModel.from_pretrained("your-username/your-model", torch_dtype=torch.float16, variant="fp16")
228+
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", unet=unet, torch_dtype=torch.float16, variant="fp16")
229+
230+
pipeline.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
231+
pipeline.to("cuda")
232+
233+
prompt = "sushi rolls in the form of panda heads, sushi platter"
234+
235+
image = pipeline(prompt, num_inference_steps=4, guidance_scale=1.0).images[0]
236+
```
237+
238+
## LoRA
239+
240+
LoRA is a training technique for significantly reducing the number of trainable parameters. As a result, training is faster and it is easier to store the resulting weights because they are a lot smaller (~100MBs). Use the [train_lcm_distill_lora_sd_wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sd_wds.py) or [train_lcm_distill_lora_sdxl.wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_lora_sdxl_wds.py) script to train with LoRA.
241+
242+
The LoRA training script is discussed in more detail in the [LoRA training](lora) guide.
243+
244+
## Stable Diffusion XL
245+
246+
Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [train_lcm_distill_sdxl_wds.py](https://github.com/huggingface/diffusers/blob/main/examples/consistency_distillation/train_lcm_distill_sdxl_wds.py) script to train a SDXL model with LoRA.
247+
248+
The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide.
249+
250+
## Next steps
251+
252+
Congratulations on distilling a LCM model! To learn more about LCM, the following may be helpful:
253+
254+
- Learn how to use [LCMs for inference](../using-diffusers/lcm) for text-to-image, image-to-image, and with LoRA checkpoints.
255+
- Read the [SDXL in 4 steps with Latent Consistency LoRAs](https://huggingface.co/blog/lcm_lora) blog post to learn more about SDXL LCM-LoRA's for super fast inference, quality comparisons, benchmarks, and more.

0 commit comments

Comments
 (0)