-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Kandinsky2.2 #3903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kandinsky2.2 #3903
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@cene555 are you still working on the PR, or is it ready for review? let us know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! I left some feedbacks, most nit, we can wait to make changes until @patrickvonplaten gives his review
I think the main to-dos left are:
- add docstring examples: I've tested our doc example for text2img, img2img and inpaint, all works great. only ones missing are controlnets
- add
#copy from
statement everywhere (YiYi can help with this) - add tests and doc (YiYi can help with this)
- make
PriorEmb2Emb
work as intended - either add add_noise function to unclip or replace unclip scheduler with ddpm (YiYi can help with this) - I'm not too sure why we need this
try
statement - can you explain? if any refactor is needed, let me know #3903 (comment)
@@ -136,6 +136,13 @@ | |||
KandinskyInpaintPipeline, | |||
KandinskyPipeline, | |||
KandinskyPriorPipeline, | |||
Kandinsky2_2_DecoderControlnetImg2ImgPipeline, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten
these Decoder
pipelines are main pipelines that contains image generation process - this is consistent with the naming in the original repo and it makes sense I think
however might be a little bit confusing for our users - should we rename them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it would be nicer to remove _Decoder
here
@@ -428,6 +449,46 @@ def forward(self, text_embeds: torch.FloatTensor, image_embeds: torch.FloatTenso | |||
|
|||
return time_image_embeds + time_text_embeds | |||
|
|||
class ImageTimeEmbedding(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
self, | ||
unet: UNet2DConditionModel, | ||
scheduler: DDPMScheduler, | ||
vae: VQModel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be consistent with the 2.1 in diffusers and name it movq
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cene555 is the model a movq here or just a vae?
# YiYi notes: only reason this pipeline can't work with unclip scheduler is that can't pass down this argument | ||
# need to use DDPM scheduler instead | ||
# prev_timestep=prev_timestep, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# YiYi notes: only reason this pipeline can't work with unclip scheduler is that can't pass down this argument | |
# need to use DDPM scheduler instead | |
# prev_timestep=prev_timestep, |
we can remove this note now:)
|
||
class Kandinsky2_2_DecoderControlnetPipeline(DiffusionPipeline): | ||
""" | ||
Pipeline for text-to-image generation using Kandinsky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline for text-to-image generation using Kandinsky | |
Pipeline for text-to-image generation using Kandinsky with ControlNet guidance. |
noise_pred, _ = noise_pred.split(latents.shape[1], dim=1) | ||
|
||
# compute the previous noisy sample x_t -> x_t-1 | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this here: do we have a scheduler.step that does not accept generator
argument?
in any case we should be very clear about what schedulers this pipeline can work with and be able to work with them
|
||
class Kandinsky2_2_DecoderImg2ImgPipeline(DiffusionPipeline): | ||
""" | ||
Pipeline for text-to-image generation using Kandinsky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline for text-to-image generation using Kandinsky | |
Pipeline for image-to-image generation using Kandinsky |
text_encoder ([`MultilingualCLIP`]): | ||
Frozen text-encoder. | ||
tokenizer ([`XLMRobertaTokenizer`]): | ||
Tokenizer of class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text_encoder ([`MultilingualCLIP`]): | |
Frozen text-encoder. | |
tokenizer ([`XLMRobertaTokenizer`]): | |
Tokenizer of class |
noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype) | ||
|
||
# get latents | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to make sure either it works with ddpm or add add_noise
method to unclip scheduler
text_encoder ([`MultilingualCLIP`]): | ||
Frozen text-encoder. | ||
tokenizer ([`XLMRobertaTokenizer`]): | ||
Tokenizer of class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text_encoder ([`MultilingualCLIP`]): | |
Frozen text-encoder. | |
tokenizer ([`XLMRobertaTokenizer`]): | |
Tokenizer of class |
|
||
self.num_image_text_embeds = num_image_text_embeds | ||
self.image_embeds = nn.Linear(image_embed_dim, self.num_image_text_embeds * cross_attention_dim) | ||
self.norm = nn.LayerNorm(cross_attention_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.norm = nn.LayerNorm(cross_attention_dim) | |
self.norm = nn.LayerNorm(cross_attention_dim) | |
@@ -26,7 +26,10 @@ | |||
from .embeddings import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here look all good to me! would just be nice / important to add some tests and docstrings
self.input_hint_block = nn.Sequential( | ||
nn.Conv2d(3, 16, 3, padding=1), | ||
nn.SiLU(), | ||
nn.Conv2d(16, 16, 3, padding=1), | ||
nn.SiLU(), | ||
nn.Conv2d(16, 32, 3, padding=1, stride=2), | ||
nn.SiLU(), | ||
nn.Conv2d(32, 32, 3, padding=1), | ||
nn.SiLU(), | ||
nn.Conv2d(32, 96, 3, padding=1, stride=2), | ||
nn.SiLU(), | ||
nn.Conv2d(96, 96, 3, padding=1), | ||
nn.SiLU(), | ||
nn.Conv2d(96, 256, 3, padding=1, stride=2), | ||
nn.SiLU(), | ||
nn.Conv2d(256, 4, 3, padding=1) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.input_hint_block = nn.Sequential( | |
nn.Conv2d(3, 16, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(16, 16, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(16, 32, 3, padding=1, stride=2), | |
nn.SiLU(), | |
nn.Conv2d(32, 32, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(32, 96, 3, padding=1, stride=2), | |
nn.SiLU(), | |
nn.Conv2d(96, 96, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(96, 256, 3, padding=1, stride=2), | |
nn.SiLU(), | |
nn.Conv2d(256, 4, 3, padding=1) | |
) | |
self.input_hint_block = nn.Sequential( | |
nn.Conv2d(3, 16, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(16, 16, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(16, 32, 3, padding=1, stride=2), | |
nn.SiLU(), | |
nn.Conv2d(32, 32, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(32, 96, 3, padding=1, stride=2), | |
nn.SiLU(), | |
nn.Conv2d(96, 96, 3, padding=1), | |
nn.SiLU(), | |
nn.Conv2d(96, 256, 3, padding=1, stride=2), | |
nn.SiLU(), | |
nn.Conv2d(256, 4, 3, padding=1) | |
) |
@@ -375,6 +375,27 @@ def forward(self, text_embeds: torch.FloatTensor, image_embeds: torch.FloatTenso | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me!
@@ -136,6 +136,13 @@ | |||
KandinskyInpaintPipeline, | |||
KandinskyPipeline, | |||
KandinskyPriorPipeline, | |||
Kandinsky2_2_DecoderControlnetImg2ImgPipeline, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kandinsky2_2_DecoderControlnetImg2ImgPipeline, | |
KandinskyControlnetV22Img2ImgPipeline, |
Kandinsky2_2_DecoderControlnetPipeline, | ||
Kandinsky2_2_DecoderImg2ImgPipeline, | ||
Kandinsky2_2_DecoderPipeline, | ||
Kandinsky2_2PriorEmb2EmbPipeline, | ||
Kandinsky2_2PriorPipeline, | ||
Kandinsky2_2_DecoderInpaintPipeline, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kandinsky2_2_DecoderControlnetPipeline, | |
Kandinsky2_2_DecoderImg2ImgPipeline, | |
Kandinsky2_2_DecoderPipeline, | |
Kandinsky2_2PriorEmb2EmbPipeline, | |
Kandinsky2_2PriorPipeline, | |
Kandinsky2_2_DecoderInpaintPipeline, | |
KandinskyV22ControlnetPipeline, | |
KandinskyV22Img2ImgPipeline, | |
KandinskyV22DecoderPipeline, | |
KandinskyV22PriorEmb2EmbPipeline, | |
KandinskyV22PriorPipeline, | |
KandinskyV22InpaintPipeline, |
What does this PR do?
Fixes # (issue)
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.