@@ -28,6 +28,7 @@ Resources:
2828
2929## Tips
3030
31+ * The pipeline can be conditioned on real input images. Check out the code examples below to know more.
3132* The pipeline exposes two arguments namely `source_embeds` and `target_embeds`
3233that let you control the direction of the semantic edits in the final image to be generated. Let's say,
3334you wanted to translate from " cat" to " dog" . In this case, the edit direction will be " cat -> dog" . To reflect
@@ -51,7 +52,7 @@ paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions
5152
5253## Usage example
5354
54- ** Based on an image generated with the input prompt**
55+ ### Based on an image generated with the input prompt
5556
5657```python
5758import requests
@@ -93,9 +94,77 @@ images = pipeline(
9394images[0].save("edited_image_dog.png")
9495```
9596
96- ** Based on an input image**
97+ ### Based on an input image
9798
98- _Coming soon_
99+ When the pipeline is conditioned on an input image, we first obtain an inverted
100+ noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then
101+ the inverted noise is used to start the generation process.
102+
103+ First, let's load our pipeline:
104+
105+ ```py
106+ import torch
107+ from transformers import BlipForConditionalGeneration, BlipProcessor
108+ from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionPix2PixZeroPipeline
109+
110+ captioner_id = " Salesforce/blip-image-captioning-base"
111+ processor = BlipProcessor.from_pretrained(captioner_id)
112+ model = BlipForConditionalGeneration.from_pretrained(captioner_id, torch_dtype =torch.float16, low_cpu_mem_usage =True)
113+
114+ sd_model_ckpt = " CompVis/stable-diffusion-v1-4"
115+ pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained(
116+ sd_model_ckpt,
117+ caption_generator =model,
118+ caption_processor =processor,
119+ torch_dtype =torch.float16,
120+ safety_checker =None,
121+ )
122+ pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
123+ pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
124+ pipeline.enable_model_cpu_offload()
125+ ```
126+
127+ Then, we load an input image for conditioning and obtain a suitable caption for it:
128+
129+ ```py
130+ import requests
131+ from PIL import Image
132+
133+ img_url = " https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png"
134+ raw_image = Image.open(requests.get(img_url, stream =True).raw).convert("RGB").resize((512, 512))
135+ caption = pipeline.generate_caption(raw_image)
136+ ```
137+
138+ Then we employ the generated caption and the input image to get the inverted noise:
139+
140+ ```py
141+ inv_latents, inv_image = pipeline.invert(caption, image =raw_image)
142+ ```
143+
144+ Now, generate the image with edit directions:
145+
146+ ```py
147+ # See the " Generating source and target embeddings" section below to
148+ # automate the generation of these captions with a pre-trained model like Flan-T5.
149+ source_prompts = ["a cat sitting on the street", " a cat playing in the field" , " a face of a cat" ]
150+ target_prompts = ["a dog sitting on the street", " a dog playing in the field" , " a face of a dog" ]
151+
152+ source_embeds = pipeline.get_embeds(source_prompts, batch_size =2)
153+ target_embeds = pipeline.get_embeds(target_prompts, batch_size =2)
154+
155+
156+ image = pipeline(
157+ caption,
158+ source_embeds =source_embeds,
159+ target_embeds =target_embeds,
160+ num_inference_steps =50,
161+ cross_attention_guidance_amount =0.15,
162+ generator =generator,
163+ latents =inv_latents,
164+ negative_prompt =caption,
165+ ).images[0]
166+ image.save("edited_image.png")
167+ ```
99168
100169## Generating source and target embeddings
101170
0 commit comments