Textual inversion (huggingface#266)

patil-suraj · web-flow · commit d0d3e24ec19d · 2022-09-02T14:23:52.000+05:30
* add textual inversion script

* make the loop work

* make coarse_loss optional

* save pipeline after training

* add arg pretrained_model_name_or_path

* fix saving

* fix gradient_accumulation_steps

* style

* fix progress bar steps

* scale lr

* add argument to accept style

* remove unused args

* scale lr using num gpus

* load tokenizer using args

* add checks when converting init token to id

* improve commnets and style

* document args

* more cleanup

* fix default adamw arsg

* TextualInversionWrapper -&gt; CLIPTextualInversionWrapper

* fix tokenizer loading

* Use the CLIPTextModel instead of wrapper

* clean dataset

* remove commented code

* fix accessing grads for multi-gpu

* more cleanup

* fix saving on multi-GPU

* init_placeholder_token_embeds

* add seed

* fix flip

* fix multi-gpu

* add utility methods in wrapper

* remove ipynb

* don't use wrapper

* dont pass vae an dunet to accelerate prepare

* bring back accelerator.accumulate

* scale latents

* use only one progress bar for steps

* push_to_hub at the end of training

* remove unused args

* log some important stats

* store args in tensorboard

* pretty comments

* save the trained embeddings

* mobe the script up

* add requirements file

* more cleanup

* fux typo

* begin readme

* style -&gt; learnable_property

* keep vae and unet in eval mode

* address review comments

* address more comments

* removed unused args

* add train command in readme

* update readme
diff --git a/examples/textual_inversion/README.md b/examples/textual_inversion/README.md
@@ -0,0 +1,82 @@
+## Textual Inversion fine-tuning example
+
+[Textual inversion](https://arxiv.org/abs/2208.01618) is a method to personalize text2image models like stable diffusion on your own images using just 3-5 examples.
+The `textual_inversion.py` script shows how to implement the training procedure and adapt it for stable diffusion.
+
+### Installing the dependencies
+
+Before running the scipts, make sure to install the library's training dependencies:
+
+```bash
+pip install diffusers[training] accelerate transformers
+```
+
+And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
+
+```bash
+accelerate config
+```
+
+
+### Cat toy example
+
+You need to accept the model license before downloading or using the weights. In this example we'll use model version `v1-4`, so you'll need to visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. 
+
+You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
+
+Run the following command to autheticate your token
+
+```bash
+huggingface-cli login
+```
+
+If you have already cloned the repo, then you won't need to go through these steps. You can simple remove the `--use_auth_token` arg from the following command.
+
+<br>
+
+Now let's get our dataset.Download 3-4 images from [here](https://drive.google.com/drive/folders/1fmJMs25nxS_rSNqS5hTcRdLem_YQXbq5) and save them in a directory. This will be our training data.
+
+And launch the training using
+
+```bash
+export MODEL_NAME="CompVis/stable-diffusion-v1-4"
+export DATA_DIR="path-to-dir-containing-images"
+
+accelerate launch textual_inversion.py \
+  --pretrained_model_name_or_path=$MODEL_NAME --use_auth_token \
+  --train_data_dir=$DATA_DIR \
+  --learnable_property="object" \
+  --placeholder_token="<cat-toy>" --initializer_token="toy" \
+  --resolution=512 \
+  --train_batch_size=1 \
+  --gradient_accumulation_steps=2 \
+  --max_train_steps=3000 \
+  --learning_rate=5.0e-04 --scale_lr \
+  --lr_scheduler="constant" \
+  --lr_warmup_steps=0 \
+  --output_dir="textual_inversion_cat"
+```
+
+A full training run takes ~1 hour on one V100 GPU.
+
+
+### Inference
+
+Once you have trained a model using above command, the inference can be done simply using the `StableDiffusionPipeline`. Make sure to include the `placeholder_token` in your prompt.
+
+
+```python
+
+from torch import autocast
+from diffusers import StableDiffusionPipeline
+
+model_id = "path-to-your-trained-model"
+pipe = pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to("cuda")
+
+prompt = "A <cat-toy> backpack"
+
+with autocast("cuda"):
+    image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5)["sample"][0]
+
+image.save("cat-backpack.png")
+```
diff --git a/examples/textual_inversion/requirements.txt b/examples/textual_inversion/requirements.txt
@@ -0,0 +1,3 @@
+accelerate
+torchvision
+transformers
diff --git a/examples/textual_inversion/textual_inversion.py b/examples/textual_inversion/textual_inversion.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+accelerate`
	`2`	`+torchvision`
	`3`	`+transformers`