Skip to content

Commit 867a217

Browse files
authored
add: inversion to pix2pix zero docs. (huggingface#2398)
* add: inversion to pix2pix zero docs. * add: comment to emphasize the use of flan to generate. * more nits.
1 parent 0c0bb08 commit 867a217

File tree

1 file changed

+72
-3
lines changed

1 file changed

+72
-3
lines changed

docs/source/en/api/pipelines/stable_diffusion/pix2pix_zero.mdx

Lines changed: 72 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Resources:
2828

2929
## Tips
3030

31+
* The pipeline can be conditioned on real input images. Check out the code examples below to know more.
3132
* The pipeline exposes two arguments namely `source_embeds` and `target_embeds`
3233
that let you control the direction of the semantic edits in the final image to be generated. Let's say,
3334
you wanted to translate from "cat" to "dog". In this case, the edit direction will be "cat -> dog". To reflect
@@ -51,7 +52,7 @@ paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions
5152

5253
## Usage example
5354

54-
**Based on an image generated with the input prompt**
55+
### Based on an image generated with the input prompt
5556

5657
```python
5758
import requests
@@ -93,9 +94,77 @@ images = pipeline(
9394
images[0].save("edited_image_dog.png")
9495
```
9596

96-
**Based on an input image**
97+
### Based on an input image
9798

98-
_Coming soon_
99+
When the pipeline is conditioned on an input image, we first obtain an inverted
100+
noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then
101+
the inverted noise is used to start the generation process.
102+
103+
First, let's load our pipeline:
104+
105+
```py
106+
import torch
107+
from transformers import BlipForConditionalGeneration, BlipProcessor
108+
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline
109+
110+
captioner_id = "Salesforce/blip-image-captioning-base"
111+
processor = BlipProcessor.from_pretrained(captioner_id)
112+
model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype=torch.float16, low_cpu_mem_usage=True)
113+
114+
sd_model_ckpt = "CompVis/stable-diffusion-v1-4"
115+
pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
116+
sd_model_ckpt,
117+
caption_generator=model,
118+
caption_processor=processor,
119+
torch_dtype=torch.float16,
120+
safety_checker=None,
121+
)
122+
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
123+
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
124+
pipeline.enable_model_cpu_offload()
125+
```
126+
127+
Then, we load an input image for conditioning and obtain a suitable caption for it:
128+
129+
```py
130+
import requests
131+
from PIL import Image
132+
133+
img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"
134+
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512))
135+
caption = pipeline.generate_caption(raw_image)
136+
```
137+
138+
Then we employ the generated caption and the input image to get the inverted noise:
139+
140+
```py
141+
inv_latents, inv_image = pipeline.invert(caption, image=raw_image)
142+
```
143+
144+
Now, generate the image with edit directions:
145+
146+
```py
147+
# See the "Generating source and target embeddings" section below to
148+
# automate the generation of these captions with a pre-trained model like Flan-T5.
149+
source_prompts = ["a cat sitting on the street", "a cat playing in the field", "a face of a cat"]
150+
target_prompts = ["a dog sitting on the street", "a dog playing in the field", "a face of a dog"]
151+
152+
source_embeds = pipeline.get_embeds(source_prompts, batch_size=2)
153+
target_embeds = pipeline.get_embeds(target_prompts, batch_size=2)
154+
155+
156+
image = pipeline(
157+
caption,
158+
source_embeds=source_embeds,
159+
target_embeds=target_embeds,
160+
num_inference_steps=50,
161+
cross_attention_guidance_amount=0.15,
162+
generator=generator,
163+
latents=inv_latents,
164+
negative_prompt=caption,
165+
).images[0]
166+
image.save("edited_image.png")
167+
```
99168

100169
## Generating source and target embeddings
101170

0 commit comments

Comments
 (0)