-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[ip-adapter] refactor prepare_ip_adapter_image_embeds and skip load image_encoder
#7016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ip_hidden_states directly and skip load image_encoder ip_hidden_states directly and skip load image_encoder
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @asomoza feel free to give a review:) |
|
If you could also describe how was "diffusers_style_test.ipadpt" generated, I think that'd be a useful reference for the community. |
| if image_encoder_folder is not None: | ||
| if not isinstance(pretrained_model_name_or_path_or_dict, dict): | ||
| logger.info(f"loading image_encoder from {pretrained_model_name_or_path_or_dict}") | ||
| if image_encoder_folder.count("/") == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧠
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking nice!
I think it'd be nice to add:
- A test
- A section in https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter about this and the support to pass
ip_adapter_image_embeds.
|
Nice that this also solves the problem with the different image encoders for SDXL. I'm still testing and everything works except when I try to load the embeddings from comfyui, I get a shape error. I'm trying to figure out where is the difference between the image embeddings form diffusers and the ones from comfyui. edit: This is not a problem with this PR though, probably need some kind of conversion. |
Just did a image_embeds = prepare_ip_adapter_image_embeds(
unet=pipeline.unet,
image_encoder=pipeline.image_encoder,
feature_extractor=pipeline.feature_extractor,
ip_adapter_image=[[image_one, image_two]],
do_classifier_free_guidance=True,
device="cuda",
num_images_per_prompt=1,
)
torch.save(image_embeds, "diffusers_style_test.ipadpt") |
|
I found another issue while doing some tests, I didn't notice it before since I don't use multiple images per prompt, but the code right now expects the embeddings to match this argument, for example in the SDXL pipeline: diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py Line 573 in 12004bf
This ties the embeddings to the number of images per prompt which is not ideal unless you keep track of this. I see two solutions: 1.- Document this and leave it to the users I'll wait to see your resolution since this makes it hard to reuse the embeddings or make them compatible with other apps. As an example, if I use the same code that generates the horse but with the |
|
Thanks for reporting!
I think we should catch these things early and report to the users so that they can call the pipeline with the appropriate values. We'd want the
What is the expected result? |
AFAIK there's no way to catch this, this doesn't produce any errors is just that the final result is different and kind of worse than the original. The only way to make it work as intended, is if you match the number of images per prompt with the embeddings, so you'll need to also remember this value, put it in the filename or save it as a metadata in the file, but this also makes it only usable for that specific case. This is not a critical issue, specially for people that just use one image per prompt, but with it, I don't see the need to put any more effort into making it compatible with saved embeddings from other apps or libraries.
The same as the first image in this PR: |
|
I think it does warrant a deeper investigation then why it’s the case. We should fix the root cause here IMO. |
src/diffusers/pipelines/deprecated/versatile_diffusion/modeling_text_unet.py
Outdated
Show resolved
Hide resolved
|
about the other issue
i think solution 2 is an easy answer, no? any downside to it that I missed? |
yeah, for me it is the solution I would choose but sometimes there's an underlying reason in diffusers that I don't know which prevents it. |
|
ok, I will make changes based on solution 2. I think it's fine because that's consistent with diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py Line 492 in d8d208a
but we will wait @sayakpaul back next week to make a final decision on what we do |
|
There's another parameter that gets saved with the image embeddings and ties it up, the diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py Line 578 in 3067da1
So if we're going to move the |
|
I can now make my saved embeddings work with I was wrong about the image projection being done before the saving though, so the comfyui embeds works passing them with I can load and run the comfyui embeds like this: comfy_image_embeds = torch.load("comfyui_style_test.ipadpt")
embeds = torch.unbind(comfy_image_embeds )
single_image_embeds = embeds[0]
single_negative_image_embeds = embeds[1]
single_image_embeds = torch.stack([single_image_embeds] * num_images_per_prompt, dim=0)
single_negative_image_embeds = torch.stack([single_negative_image_embeds] * num_images_per_prompt, dim=0)
image_embeds = torch.cat([single_negative_image_embeds, single_image_embeds])but there's no way to make the |
|
@asomoza it is a |
| ) | ||
|
|
||
| if self.do_classifier_free_guidance: | ||
| if do_classifier_free_guidance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding do_classifier_free_guidance as an argument to prepare_ip_adapter_imabe_embeds so we can use this pipeline method to save image embeddings
image_embeds = pipeline.prepare_ip_adapter_image_embeds(
ip_adapter_image=image,
ip_adapter_image_embeds=None,
device="cuda",
num_images_per_prompt=1,
do_classifier_free_guidance=True,
)
torch.save(image_embeds, "image_embeds.ipadpt")ip_hidden_states directly and skip load image_encoder prepare_ip_adapter_image_embeds and skip load image_encoder
| > [!TIP] | ||
| > While calling `load_ip_adapter()`, pass `low_cpu_mem_usage=True` to speed up the loading time. | ||
| All the pipelines supporting IP-Adapter accept a `ip_adapter_image_embeds` argument. If you need to run the IP-Adapter multiple times with the same image, you can encode the image once and save the embedding to the disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @stevhliu here for awareness
I added a section to ip-adapter guide here. Let me know if you have any comments. If editing in a separate PR is easier, feel free to do so!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another very good use case for ip_adapter_image_embeds is probably the multi-ip-adapter https://huggingface.co/docs/diffusers/main/en/using-diffusers/ip_adapter#multi-ip-adapter
a common practice is to use a folder of 10+ images for styling, and you would use the same styling images everywhere to create a consistent style, so it would be nice to create an image embedding for these style images, so you don't have to load a bunch of same images from a folder and encode them each time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should definitely add that example motivating the use case. WDYT @asomoza?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll edit it in a separate PR, and I can also make a mention of ip_adapter_image_embeds in the multi IP-Adapter section 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should definitely add that example motivating the use case. WDYT @asomoza?
yeah this is specially helpful when you use a lot of images and multiple ip adapters, you just need to save the embeddings making it a lot easier to replicate and saves a lot of space if you use high quality images.
I'll try to do one with a style and a character and see how it goes, but to see the real potential of this we'll also need controlnet and ip adapter masking so the best use case would be a full demo with all of this incorporated.
|
finally finishing up this PR now. I refactored some more feel free to give a final review |
| > ComfyUI image embeddings are fully compatible with IP-Adapter in diffusers and will work out-of-box. | ||
| ```py | ||
| image_embeds = torch.load("image_embeds.ipadpt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know where this is coming from. Let's include a snippet to download that and explicitly mention that it's coming from ComfyUI.
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty solid. I left a couple of suggestions to the docs. I reviewed the changes made to ip_adapter.py and the changes in prepare_ip_adapter_image_embeds() and check_inputs from SDXL pipeline script. I think the rest of the pipelines share these changes?
I thought we were also supporting passing the image embedding projection as well. Are we not doing so?
| if ip_adapter_image_embeds is not None: | ||
| if not isinstance(ip_adapter_image_embeds, list): | ||
| raise ValueError( | ||
| f"`ip_adapter_image_embeds` has to be of type `list` but is {type(ip_adapter_image_embeds)}" | ||
| ) | ||
| elif ip_adapter_image_embeds[0].ndim != 3: | ||
| raise ValueError( | ||
| f"`ip_adapter_image_embeds` has to be a list of 3D tensors but is {ip_adapter_image_embeds[0].ndim}D" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need any checks on the shapes to conform to what's needed for classifier-free guidance?
Co-authored-by: Sayak Paul <[email protected]>
Co-authored-by: Sayak Paul <[email protected]>
so it turns out comfyUI embedding is created before the image projection layer - so we don't need to support passing the projection output directly anymore since it is too small an use case |


fix #6925
allow pass
ip_adapter_image_embedsdirectly and skip loadingimage_encoderwe also allow specifying a different subfolder for
image_encoder_foldere.g. if the ip-adapter checkpoint needs to use image_encoder that's not from the same subfolder, you do not need to load it explicitly as described here in the doc https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading_adapters#ip-adapter-plus you can specify the
image_encoder_folderinstead