Skip to content

[SD-XL] Ability to easily split prompt over the two text encoders #4004

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bghira opened this issue Jul 9, 2023 · 10 comments · Fixed by #4156
Closed

[SD-XL] Ability to easily split prompt over the two text encoders #4004

bghira opened this issue Jul 9, 2023 · 10 comments · Fixed by #4156

Comments

@bghira
Copy link
Contributor

bghira commented Jul 9, 2023

Is your feature request related to a problem? Please describe.
SDXL 0.9 comes with a new dual text encoder pipeline.

OpenCLIP ViT-bigG/14 and CLIP-L are both paired up in this pipeline. When running through ComfyUI, the CLIP nodes allow for inputting different pieces of the prompt, to different encoders. The default configuration is like ours, and the same prompt is handed to both encoders.

However, the creative outcome of having additional flexibility of treating the entire embed space as a single concat over the whole prompt's context, drastically alters the results.

Describe the solution you'd like
We are interested in adding optional parameters to the SDXL Base and Img2Img pipelines to allow this flexibility.

  • prompt_2 and negative_prompt_2 would be great names, as they match the naming convention of text_encoder_2/tokenizer_2

Describe alternatives you've considered

  • Creating a custom pipeline, which does not force-multiply our efforts.
@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Jul 9, 2023

Hey @bghira,

Can you show me an example of where providing different text prompts for each text encoder gives much better results? Also, we allow the user to directly provide prompt embeds, so I wonder if this is not enough to cover this use case? #3995

@bghira
Copy link
Contributor Author

bghira commented Jul 9, 2023

nor i or anyone i've asked haven't been able to get the prompt embeds working, and imo having a built in way of doing this seems like it would be really beneficial without having users need to pull Compel in, which they may not be comfortable with.

The subject portion of the prompt in OpenCLIP, and style in CLIP-L:
image

The subject portion of prompt in CLIP-L and the style in OpenCLIP:
image

The subject and style prompt in CLIP-L, with OpenCLIP as unconditonal guidance:
image

The subject and style prompt in OpenCLIP, with CLIP-L as unconditional guidance:
image

Both encoders have both portions of the prompt
image

@sayakpaul
Copy link
Member

Spectacular results!

nor i or anyone i've asked haven't been able to get the prompt embeds working

Could you expand on this further? Do you mean passing prompt_embeds don't work with our SDXL pipeline?

@patrickvonplaten
Copy link
Contributor

Ok this seems to make a lot of sense then, thanks for the results! Think it shouldn't be too difficult to support it wit has you say a prompt_2 and negative_prompt_2 input, ok for me to add this! @bghira would you like to give the PR a try? :-)

@bghira
Copy link
Contributor Author

bghira commented Jul 11, 2023

@patrickvonplaten i know my limits, and text embeds seem to be one :D i simply propose the idea for others who willing to take it up and understand these components better.

@bghira
Copy link
Contributor Author

bghira commented Jul 11, 2023

Could you expand on this further? Do you mean passing prompt_embeds don't work with our SDXL pipeline?

the Compel pull request wasn't yet available when I was messing around with them. I tried extracting relevant bits from the XL pipeline and just wasn't able to figure it out. there's not a lot of documentation on this level that makes sense to lesser-informed individuals like myself, so i'm never sure why i'm getting this or that dimensionality error. it's just guessing, digging in with print(f'') statements, and spending an inordinate amount of time looking at things I don't understand.

I haven't gone on to try the Compel PR yet because then yesterday I was stuck on 4003 issues before I realised, the whole pipeline architecture of Diffusers has off-by-one errors. I feel like this kind of subtle behaviour is really going to bite me again when I go back into text embeds.

ergo, it is not something I feel I can accomplish.

@patrickvonplaten
Copy link
Contributor

Actually I think we can have prompt_embeds with Compel as well as a very easy user-interface with prompt_2 and negative_prompt_2 :-) So if you'd like to add this, I'm more than happy to review a PR!

@MercuryOoO
Copy link

nor i or anyone i've asked haven't been able to get the prompt embeds working, and imo having a built in way of doing this seems like it would be really beneficial without having users need to pull Compel in, which they may not be comfortable with.

The subject portion of the prompt in OpenCLIP, and style in CLIP-L: image

The subject portion of prompt in CLIP-L and the style in OpenCLIP: image

The subject and style prompt in CLIP-L, with OpenCLIP as unconditonal guidance: image

The subject and style prompt in OpenCLIP, with CLIP-L as unconditional guidance: image

Both encoders have both portions of the prompt image

Off-topic, since I didn't find a way to send you a private message, I took the liberty of asking you here. Can you tell me what model and prompt you used to generate these images?

@bghira
Copy link
Contributor Author

bghira commented Jan 29, 2024

that is my model, Terminus. though it is a much earlier version. I don't have the prompt anymore.

@MercuryOoO
Copy link

Terminus

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants