Asymmetric vqgan #3956

cross-attention · 2023-07-05T14:56:57Z

Added AsymmetricAutoencoderKL for Stable Diffusion Inpainting

Added AsymmetricAutoencoderKL model from Designing a Better Asymmetric VQGAN for StableDiffusion https://arxiv.org/abs/2306.04632

Added its support in StableDiffusionInpaintPipeline

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@patrickvonplaten @sayakpaul

HuggingFaceDocBuilderDev · 2023-07-05T15:03:09Z

The documentation is not available anymore as the PR was closed or merged.

sayakpaul · 2023-07-07T07:13:00Z

Hi @cross-attention!

Thanks for your PR! Could you maybe share some results with this new Autoencoder? That will help us to better evaluate this.

Maybe if you could do:

from diffusers AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline

vae = AsymmetricAutoencoderKL.from_pretrained("the-ckpt-id")
pipeline = StableDiffusionInpaintPipeline.from_pretrained(ckpt_id, vae=vae).to("cuda")

...

That would be great!

cross-attention · 2023-07-07T08:47:59Z

@sayakpaul
Sure! Here are the results with runwayml/stable-diffusion-inpainting (prompt is "a closeup photo of a male person in a black t-shirt on a solid yellow background")
There are 4 groups with 9 examples in each with the following VAE setup

default
stabilityai/sd-vae-ft-mse
AsymmetricAutoencoderKL x1.5 scale
AsymmetricAutoencoderKL x2 scale
https://i.imgur.com/arZi2vT.jpg

sayakpaul · 2023-07-07T09:02:09Z

That looks great! Thank you! Which checkpoint did you use for the final two cases?

AsymmetricAutoencoderKL x1.5 scale and AsymmetricAutoencoderKL x2 scale?

Could you maybe provide us some code snippets?

cross-attention · 2023-07-07T09:21:28Z

I used the original checkpoints from https://github.com/buxiangzhiren/Asymmetric_VQGAN/
To match the keys I used the following code

import torch
from diffusers import AsymmetricAutoencoderKL

# x1.5
ckpt = torch.load("./checkpoints/larger1.5.ckpt", map_location="cpu")
vae = AsymmetricAutoencoderKL(
    in_channels = 3,
    out_channels = 3,
    down_block_types = ("DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D"),
    down_block_out_channels = (128, 256, 512, 512),
    layers_per_down_block = 2,
    up_block_types = ("UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"),
    up_block_out_channels = (192, 384, 768, 768),
    layers_per_up_block = 3,
    act_fn = "silu",
    latent_channels = 4,
    norm_num_groups = 32,
    sample_size = 256,
    scaling_factor = 0.18215,
)
# x2
ckpt = torch.load("./checkpoints/larger2.ckpt", map_location="cpu")
vae = AsymmetricAutoencoderKL(
    in_channels = 3,
    out_channels = 3,
    down_block_types = ("DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D"),
    down_block_out_channels = (128, 256, 512, 512),
    layers_per_down_block = 2,
    up_block_types = ("UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D", "UpDecoderBlock2D"),
    up_block_out_channels = (256, 512, 1024, 1024),
    layers_per_up_block = 5,
    act_fn = "silu",
    latent_channels = 4,
    norm_num_groups = 32,
    sample_size = 256,
    scaling_factor = 0.18215,
)

# match keys
enc_dict = {
    k
    .replace("encoder.down.", "encoder.down_blocks.")
    .replace("encoder.mid.", "encoder.mid_block.")
    .replace("encoder.norm_out.", "encoder.conv_norm_out.")
    .replace(".downsample.", ".downsamplers.0.")
    .replace(".nin_shortcut.", ".conv_shortcut.")
    .replace(".block.", ".resnets.")
    .replace(".block_1.", ".resnets.0.")
    .replace(".block_2.", ".resnets.1.")
    .replace(".attn_1.k.", ".attentions.0.to_k.")
    .replace(".attn_1.q.", ".attentions.0.to_q.")
    .replace(".attn_1.v.", ".attentions.0.to_v.")
    .replace(".attn_1.proj_out.", ".attentions.0.to_out.0.")
    .replace(".attn_1.norm.", ".attentions.0.group_norm.")
    :
    v
    for k, v in ckpt["state_dict"].items() if k.startswith("encoder.")
}
for k in enc_dict.keys():
    if (
        k.startswith("encoder.mid_block.attentions.0") and
        k.endswith("weight") and
        ("to_q" in k or "to_k" in k or "to_v" in k or "to_out" in k)
    ):
        enc_dict[k] = enc_dict[k][:, :, 0, 0]
dec_dict = {
    k
    .replace(".norm_out.", ".conv_norm_out.")
    .replace(".up.0.", ".up_blocks.3.")
    .replace(".up.1.", ".up_blocks.2.")
    .replace(".up.2.", ".up_blocks.1.")
    .replace(".up.3.", ".up_blocks.0.")
    .replace(".block.", ".resnets.")
    .replace("mid", "mid_block")
    .replace(".0.upsample.", ".0.upsamplers.0.")
    .replace(".1.upsample.", ".1.upsamplers.0.")
    .replace(".2.upsample.", ".2.upsamplers.0.")
    .replace(".nin_shortcut.", ".conv_shortcut.")
    .replace(".block_1.", ".resnets.0.")
    .replace(".block_2.", ".resnets.1.")
    .replace(".attn_1.k.", ".attentions.0.to_k.")
    .replace(".attn_1.q.", ".attentions.0.to_q.")
    .replace(".attn_1.v.", ".attentions.0.to_v.")
    .replace(".attn_1.proj_out.", ".attentions.0.to_out.0.")
    .replace(".attn_1.norm.", ".attentions.0.group_norm.")
    :
    v
    for k, v in ckpt["state_dict"].items() if (
        k.startswith("decoder.") and
        not k.startswith("decoder.up_layers.") and
        not k.startswith("decoder.encoder.")
    )
}
for k in dec_dict.keys():
    if (
        k.startswith("decoder.mid_block.attentions.0") and
        k.endswith("weight") and
        ("to_q" in k or "to_k" in k or "to_v" in k or "to_out" in k)
    ):
        dec_dict[k] = dec_dict[k][:, :, 0, 0]
cond_enc_dict = {
    k
    .replace("decoder.up_layers.", "decoder.condition_encoder.up_layers.")
    .replace("decoder.encoder.", "decoder.condition_encoder.")
    :
    v
    for k, v in ckpt["state_dict"].items() if (
        k.startswith("decoder.up_layers.") or
        k.startswith("decoder.encoder.")
    )
}
quant_conv_dict = {k: v for k, v in ckpt["state_dict"].items() if k.startswith("quant_conv.")}
post_quant_conv_dict = {k: v for k, v in ckpt["state_dict"].items() if k.startswith("post_quant_conv.")}

vae.load_state_dict({**quant_conv_dict, **post_quant_conv_dict, **enc_dict, **dec_dict, **cond_enc_dict})

sayakpaul · 2023-07-07T12:05:59Z

Great this is superb stuff! From my end, I think the PR is already in good shape. I think we need the following:

A conversion script to get the original checkpoints converted in the diffusers format -- looks like you already have it. You can check how we structure these scripts here: https://github.com/huggingface/diffusers/tree/main/scripts.
Host the converted checkpoint on the Hugging Face Hub under https://hf.co/buxiangzhiren (which doesn't yet exist). So, we likely host them under your HF profile. Then once https://hf.co/buxiangzhiren is available we can transfer the converted checkpoint there with a nice model card including the usage example.
Add tests.
Document the usage.

Let me know anything is unclear here :-) More than happy to help.

cross-attention · 2023-07-10T14:59:18Z

@sayakpaul
Updates are ready!
Models:
x1.5 https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5
x2 https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2

docs/source/en/api/models/asymmetricautoencoderkl.mdx

tests/models/test_models_vae.py

sayakpaul

Looks fantastic to me!

Final TODOs:

https://github.com/huggingface/diffusers/pull/3956/files#r1259162306
Add nice model cards to https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2 and https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5 so that the community is aware of these.

cross-attention · 2023-07-11T09:55:07Z

@sayakpaul

added model_card
fixed tests
updated doc

docs/source/en/api/models/asymmetricautoencoderkl.mdx

sayakpaul · 2023-07-11T10:25:36Z

docs/source/en/api/models/asymmetricautoencoderkl.mdx

+* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5)
+* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2)


Since now we have https://huggingface.co/buxiangzhiren and we have made contact with the author, I think we can transfer the repositories. @patrickvonplaten could you please help?

Happy to transfer them once merged

Co-authored-by: Sayak Paul <[email protected]>

docs/source/en/api/models/asymmetricautoencoderkl.mdx

Co-authored-by: Sayak Paul <[email protected]>

cross-attention · 2023-07-17T14:35:40Z

@sayakpaul @patrickvonplaten
added comments fixes

src/diffusers/models/vae.py

Co-authored-by: Patrick von Platen <[email protected]>

patrickvonplaten · 2023-07-18T09:52:49Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

-            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
+            condition_kwargs = {}
+            if isinstance(self.vae, AsymmetricAutoencoderKL):
+                mask_condition = mask_condition.to(device=device, dtype=masked_image_latents.dtype)


Can we maybe compute the init_image_condition only here? Since it's not needed for the "normal" VAE?

patrickvonplaten · 2023-07-18T09:53:14Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

+        init_image = init_image.to(device=device, dtype=masked_image_latents.dtype)
+        init_image_condition = init_image.clone()
+        init_image = self._encode_vae_image(init_image, generator=generator)


Suggested change

init_image = init_image.to(device=device, dtype=masked_image_latents.dtype)

init_image_condition = init_image.clone()

init_image = self._encode_vae_image(init_image, generator=generator)

I think this is only needed when the decoder is of type AsymmetricAutoencoderKL - should we maybe add it further down, e.g. here: https://github.com/huggingface/diffusers/pull/3956/files#r1266533587

This way we can save an image encoding step

patrickvonplaten · 2023-07-18T09:53:29Z

tests/models/test_models_vae.py

@@ -173,6 +173,56 @@ def test_output_pretrained(self):
        self.assertTrue(torch_all_close(output_slice, expected_output_slice, rtol=1e-2))


+class AsymmetricAutoencoderKLTests(ModelTesterMixin, UNetTesterMixin, unittest.TestCase):


Nice tests!

patrickvonplaten

Almost there I think! Can we also add some tests to https://github.com/huggingface/diffusers/blob/main/tests/pipelines/stable_diffusion/test_stable_diffusion_inpaint.py to make sure the inpainting pipeline works as expected? :-)

cross-attention · 2023-07-18T11:40:45Z

@patrickvonplaten
done!

yiyixuxu

cool PR!

yiyixuxu · 2023-07-18T22:46:47Z

src/diffusers/models/vae.py

+        for l in range(len(self.layers)):
+            layer = self.layers[l]
+            x = layer(x)
+            out[str(tuple(x.shape))] = x


What's happening here? We use the string of shapes as keys to store the encoded condition and use them to match with decoder blocks?

Is it possible that two layer outputs have same shape?

yup
they are different

I still don't like this - if it's possible to configure the model in a way that output remain same shape between two layers we will have a problem here

cc @patrickvonplaten @sayakpaul let me know what you think

if it's possible to configure the model in a way that output remain same shape between two layers we will have a problem here

Valid concern. If there's a possibility of the underlying model to do this, then, yes, let's try to rejig this part.

Agree here as well - @cross-attention could we maybe do one more round of refactoring here:

instead of creating a dict here we create a list of tuples that will be returned

We also do the interpolation already here in this function instead of here: https://github.com/huggingface/diffusers/pull/3956/files#r1268048130

then in the decoder, we make image and mask required forward args: https://github.com/huggingface/diffusers/pull/3956/files#r1268045503

We keep the decoder code then much cleaner by just poping an element from the tuple

Would something like this work?

with optional mask and image we can also use AsymmetricAutoencoderKL for text2image

The mask in that case would be a None right? And it seems like AsymmetricAutoencoderKL already handles this case?

If so, it might be good to add a test to clarify that (potentially in a future PR).

src/diffusers/models/vae.py

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py

sayakpaul · 2023-07-19T11:50:37Z

@patrickvonplaten @yiyixuxu what's pending in this PR?

From what I see this is the only that's pending: #3956 (comment).

Let's try to ship this soon.

tests/models/test_models_vae.py

sayakpaul

Sweet! Thanks for iterating.

patrickvonplaten · 2023-07-19T13:04:06Z

src/diffusers/models/vae.py

+
+        self.gradient_checkpointing = False
+
+    def forward(self, z, image=None, mask=None, latent_embeds=None):


If I understand correctly image and mask can't really be None no? Can we maybe force the user to pass both image and mask here -> this would make the code much easier to follow

they can be None
in this case decoder will work without mask/image condition

In this case does it also yield improvements?

the quality is almost equal in text2image setup according to the paper

patrickvonplaten · 2023-07-19T13:06:24Z

src/diffusers/models/vae.py

+                    if image is not None and mask is not None:
+                        sample_ = im_x[str(tuple(sample.shape))]
+                        mask_ = nn.functional.interpolate(mask, size=sample.shape[-2:], mode="nearest")
+                        sample = sample * mask_ + sample_ * (1 - mask_)


Suggested change

if image is not None and mask is not None:

sample_ = im_x[str(tuple(sample.shape))]

mask_ = nn.functional.interpolate(mask, size=sample.shape[-2:], mode="nearest")

sample = sample * mask_ + sample_ * (1 - mask_)

Can we move this code to the condition encoder forward method instead? I think it would be better placed there

cross-attention · 2023-07-19T13:29:50Z

@patrickvonplaten
let's clearly define all remaining steps, having in mind last comments, please

patrickvonplaten · 2023-07-20T15:51:03Z

Let's merge this PR for now as is, but if usage goes up of Asymmetric, I'd like to do a refactor here where we don't return a dict in the form <image_shape: image_out> but just a tuple of type <image_out> instead.

By default we prefer the design of return non-mutable tuples in diffusers compared to dicts. But ok for now.

sayakpaul · 2023-07-20T15:55:03Z

Let's merge the Hun checkpoints too, @patrickvonplaten.

patrickvonplaten · 2023-07-20T16:27:26Z

Done!

https://huggingface.co/buxiangzhiren

* added AsymmetricAutoencoderKL * fixed copies+dummy * added script to convert original asymmetric vqgan * added docs * updated docs * fixed style * fixes, added tests * update doc * fixed doc * fixed tests * naming Co-authored-by: Sayak Paul <[email protected]> * naming Co-authored-by: Sayak Paul <[email protected]> * udpated code example * updated doc * comments fixes * added docstring Co-authored-by: Patrick von Platen <[email protected]> * comments fixes * added inpaint pipeline tests * comment suggestion: delete method * yet another fixes --------- Co-authored-by: Ruslan Vorovchenko <[email protected]> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

Ruslan Vorovchenko added 2 commits July 5, 2023 13:41

added AsymmetricAutoencoderKL

d306ec5

Merge branch 'main' into asymmetric-vqgan

0bc2ddc

fixed copies+dummy

1889d0a

Ruslan Vorovchenko added 5 commits July 7, 2023 14:25

added script to convert original asymmetric vqgan

7f59feb

added docs

0003eeb

updated docs

7d3b386

fixed style

35d6671

fixes, added tests

bb4ea11