Skip to content

Commit 1f22c98

Browse files
stevhliusayakpaul
andauthored
[docs] IP-Adapter image embedding (huggingface#7226)
* update * fix parameter name * feedback * add no mask version --------- Co-authored-by: Sayak Paul <[email protected]>
1 parent b4226bd commit 1f22c98

File tree

3 files changed

+103
-88
lines changed

3 files changed

+103
-88
lines changed

docs/source/en/api/loaders/ip_adapter.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,7 @@ Learn how to load an IP-Adapter checkpoint and image in the IP-Adapter [loading]
2323
## IPAdapterMixin
2424

2525
[[autodoc]] loaders.ip_adapter.IPAdapterMixin
26+
27+
## IPAdapterMaskProcessor
28+
29+
[[autodoc]] image_processor.IPAdapterMaskProcessor

docs/source/en/using-diffusers/ip_adapter.md

Lines changed: 98 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,9 @@ Let's take a look at how to use IP-Adapter's image prompting capabilities with t
2525

2626
In all the following examples, you'll see the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results.
2727

28+
> [!TIP]
29+
> In the examples below, try adding `low_cpu_mem_usage=True` to the [`~loaders.IPAdapterMixin.load_ip_adapter`] method to speed up the loading time.
30+
2831
<hfoptions id="tasks">
2932
<hfoption id="Text-to-image">
3033

@@ -231,10 +234,21 @@ export_to_gif(frames, "gummy_bear.gif")
231234
</hfoption>
232235
</hfoptions>
233236

237+
## Configure parameters
238+
239+
There are a couple of IP-Adapter parameters that are useful to know about and can help you with your image generation tasks. These parameters can make your workflow more efficient or give you more control over image generation.
240+
241+
### Image embeddings
242+
243+
IP-Adapter enabled pipelines provide the `ip_adapter_image_embeds` parameter to accept precomputed image embeddings. This is particularly useful in scenarios where you need to run the IP-Adapter pipeline multiple times because you have more than one image. For example, [multi IP-Adapter](#multi-ip-adapter) is a specific use case where you provide multiple styling images to generate a specific image in a specific style. Loading and encoding multiple images each time you use the pipeline would be inefficient. Instead, you can precompute and save the image embeddings to disk (which can save a lot of space if you're using high-quality images) and load them when you need them.
244+
234245
> [!TIP]
235-
> While calling `load_ip_adapter()`, pass `low_cpu_mem_usage=True` to speed up the loading time.
246+
> This parameter also gives you the flexibility to load embeddings from other sources. For example, ComfyUI image embeddings for IP-Adapters are compatible with Diffusers and should work ouf-of-the-box!
247+
248+
Call the [`~StableDiffusionPipeline.prepare_ip_adapter_image_embeds`] method to encode and generate the image embeddings. Then you can save them to disk with `torch.save`.
236249

237-
All the pipelines supporting IP-Adapter accept a `ip_adapter_image_embeds` argument. If you need to run the IP-Adapter multiple times with the same image, you can encode the image once and save the embedding to the disk.
250+
> [!TIP]
251+
> If you're using IP-Adapter with `ip_adapter_image_embedding` instead of `ip_adapter_image`', you can set `load_ip_adapter(image_encoder_folder=None,...)` because you don't need to load an encoder to generate the image embeddings.
238252
239253
```py
240254
image_embeds = pipeline.prepare_ip_adapter_image_embeds(
@@ -248,10 +262,7 @@ image_embeds = pipeline.prepare_ip_adapter_image_embeds(
248262
torch.save(image_embeds, "image_embeds.ipadpt")
249263
```
250264

251-
Load the image embedding and pass it to the pipeline as `ip_adapter_image_embeds`
252-
253-
> [!TIP]
254-
> ComfyUI image embeddings for IP-Adapters are fully compatible in Diffusers and should work out-of-box.
265+
Now load the image embeddings by passing them to the `ip_adapter_image_embeds` parameter.
255266

256267
```py
257268
image_embeds = torch.load("image_embeds.ipadpt")
@@ -264,8 +275,86 @@ images = pipeline(
264275
).images
265276
```
266277

267-
> [!TIP]
268-
> If you use IP-Adapter with `ip_adapter_image_embedding` instead of `ip_adapter_image`, you can choose not to load an image encoder by passing `image_encoder_folder=None` to `load_ip_adapter()`.
278+
### IP-Adapter masking
279+
280+
Binary masks specify which portion of the output image should be assigned to an IP-Adapter. This is useful for composing more than one IP-Adapter image. For each input IP-Adapter image, you must provide a binary mask an an IP-Adapter.
281+
282+
To start, preprocess the input IP-Adapter images with the [`~image_processor.IPAdapterMaskProcessor.preprocess()`] to generate their masks. For optimal results, provide the output height and width to [`~image_processor.IPAdapterMaskProcessor.preprocess()`]. This ensures masks with different aspect ratios are appropriately stretched. If the input masks already match the aspect ratio of the generated image, you don't have to set the `height` and `width`.
283+
284+
```py
285+
from diffusers.image_processor import IPAdapterMaskProcessor
286+
287+
mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
288+
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")
289+
290+
output_height = 1024
291+
output_width = 1024
292+
293+
processor = IPAdapterMaskProcessor()
294+
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)
295+
```
296+
297+
<div class="flex flex-row gap-4">
298+
<div class="flex-1">
299+
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask1.png"/>
300+
<figcaption class="mt-2 text-center text-sm text-gray-500">mask one</figcaption>
301+
</div>
302+
<div class="flex-1">
303+
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask2.png"/>
304+
<figcaption class="mt-2 text-center text-sm text-gray-500">mask two</figcaption>
305+
</div>
306+
</div>
307+
308+
When there is more than one input IP-Adapter image, load them as a list to ensure each image is assigned to a different IP-Adapter. Each of the input IP-Adapter images here correspond to the masks generated above.
309+
310+
```py
311+
face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
312+
face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")
313+
314+
ip_images = [[face_image1], [face_image2]]
315+
```
316+
317+
<div class="flex flex-row gap-4">
318+
<div class="flex-1">
319+
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl1.png"/>
320+
<figcaption class="mt-2 text-center text-sm text-gray-500">IP-Adapter image one</figcaption>
321+
</div>
322+
<div class="flex-1">
323+
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl2.png"/>
324+
<figcaption class="mt-2 text-center text-sm text-gray-500">IP-Adapter image two</figcaption>
325+
</div>
326+
</div>
327+
328+
Now pass the preprocessed masks to `cross_attention_kwargs` in the pipeline call.
329+
330+
```py
331+
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"] * 2)
332+
pipeline.set_ip_adapter_scale([0.7] * 2)
333+
generator = torch.Generator(device="cpu").manual_seed(0)
334+
num_images = 1
335+
336+
image = pipeline(
337+
prompt="2 girls",
338+
ip_adapter_image=ip_images,
339+
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
340+
num_inference_steps=20,
341+
num_images_per_prompt=num_images,
342+
generator=generator,
343+
cross_attention_kwargs={"ip_adapter_masks": masks}
344+
).images[0]
345+
image
346+
```
347+
348+
<div class="flex flex-row gap-4">
349+
<div class="flex-1">
350+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_attention_mask_result_seed_0.png"/>
351+
<figcaption class="mt-2 text-center text-sm text-gray-500">IP-Adapter masking applied</figcaption>
352+
</div>
353+
<div class="flex-1">
354+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_no_attention_mask_result_seed_0.png"/>
355+
<figcaption class="mt-2 text-center text-sm text-gray-500">no IP-Adapter masking applied</figcaption>
356+
</div>
357+
</div>
269358

270359
## Specific use cases
271360

@@ -279,6 +368,7 @@ Generating accurate faces is challenging because they are complex and nuanced. D
279368
* [ip-adapter-plus-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-plus-face_sd15.safetensors) uses patch embeddings and is conditioned with images of cropped faces
280369

281370
> [!TIP]
371+
>
282372
> [IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) is a face-specific IP-Adapter trained with face ID embeddings instead of CLIP image embeddings, allowing you to generate more consistent faces in different contexts and styles. Try out this popular [community pipeline](https://github.com/huggingface/diffusers/tree/main/examples/community#ip-adapter-face-id) and see how it compares to the other face IP-Adapters.
283373
284374
For face models, use the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) checkpoint. It is also recommended to use [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models.
@@ -502,82 +592,3 @@ image
502592
<div class="flex justify-center">
503593
    <img src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ipa-controlnet-out.png" />
504594
</div>
505-
506-
### IP-Adapter masking
507-
508-
Binary masks can be used to specify which portion of the output image should be assigned to an IP-Adapter.
509-
For each input IP-Adapter image, a binary mask and an IP-Adapter must be provided.
510-
511-
Before passing the masks to the pipeline, it's essential to preprocess them using [`IPAdapterMaskProcessor.preprocess()`].
512-
513-
> [!TIP]
514-
> For optimal results, provide the output height and width to [`IPAdapterMaskProcessor.preprocess()`]. This ensures that masks with differing aspect ratios are appropriately stretched. If the input masks already match the aspect ratio of the generated image, specifying height and width can be omitted.
515-
516-
Here an example with two masks:
517-
518-
```py
519-
from diffusers.image_processor import IPAdapterMaskProcessor
520-
521-
mask1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask1.png")
522-
mask2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask2.png")
523-
524-
output_height = 1024
525-
output_width = 1024
526-
527-
processor = IPAdapterMaskProcessor()
528-
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)
529-
```
530-
531-
<div class="flex flex-row gap-4">
532-
<div class="flex-1">
533-
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask1.png"/>
534-
<figcaption class="mt-2 text-center text-sm text-gray-500">mask one</figcaption>
535-
</div>
536-
<div class="flex-1">
537-
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_mask2.png"/>
538-
<figcaption class="mt-2 text-center text-sm text-gray-500">mask two</figcaption>
539-
</div>
540-
</div>
541-
542-
If you have more than one IP-Adapter image, load them into a list, ensuring each image is assigned to a different IP-Adapter.
543-
544-
```py
545-
face_image1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl1.png")
546-
face_image2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl2.png")
547-
548-
ip_images = [[face_image1], [face_image2]]
549-
```
550-
551-
<div class="flex flex-row gap-4">
552-
<div class="flex-1">
553-
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl1.png"/>
554-
<figcaption class="mt-2 text-center text-sm text-gray-500">ip adapter image one</figcaption>
555-
</div>
556-
<div class="flex-1">
557-
<img class="rounded-xl" src="https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ip_mask_girl2.png"/>
558-
<figcaption class="mt-2 text-center text-sm text-gray-500">ip adapter image two</figcaption>
559-
</div>
560-
</div>
561-
562-
Pass preprocessed masks to the pipeline using `cross_attention_kwargs` as shown below:
563-
564-
```py
565-
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"] * 2)
566-
pipeline.set_ip_adapter_scale([0.7] * 2)
567-
generator = torch.Generator(device="cpu").manual_seed(0)
568-
num_images = 1
569-
570-
image = pipeline(
571-
prompt="2 girls",
572-
ip_adapter_image=ip_images,
573-
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
574-
num_inference_steps=20, num_images_per_prompt=num_images,
575-
generator=generator, cross_attention_kwargs={"ip_adapter_masks": masks}
576-
).images[0]
577-
image
578-
```
579-
580-
<div class="flex justify-center">
581-
    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_attention_mask_result_seed_0.png" />
582-
<figcaption class="mt-2 text-center text-sm text-gray-500">output image</figcaption>
583-
</div>

src/diffusers/loaders/ip_adapter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -215,7 +215,7 @@ def load_ip_adapter(
215215
else:
216216
logger.warning(
217217
"image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter."
218-
"Use `ip_adapter_image_embedding` to pass pre-geneated image embedding instead."
218+
"Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead."
219219
)
220220

221221
# create feature extractor if it has not been registered to the pipeline yet

0 commit comments

Comments
 (0)