askulkarni2
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 2 additions & 0 deletions b/‎docs/source/en/_toctree.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/asymmetricautoencoderkl.mdx‎
Lines changed: 55 additions & 0 deletions b/‎docs/source/en/api/models/asymmetricautoencoderkl.mdx‎
Lines changed: 55 additions & 0 deletions
diff --git a/‎scripts/convert_asymmetric_vqgan_to_diffusers.py‎
Lines changed: 184 additions & 0 deletions b/‎scripts/convert_asymmetric_vqgan_to_diffusers.py‎
Lines changed: 184 additions & 0 deletions
diff --git a/‎src/diffusers/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎src/diffusers/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/diffusers/models/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎src/diffusers/models/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -166,6 +166,8 @@
       title: VQModel
     - local: api/models/autoencoderkl
       title: AutoencoderKL
+    - local: api/models/asymmetricautoencoderkl
+      title: AsymmetricAutoencoderKL
     - local: api/models/transformer2d
       title: Transformer2D
     - local: api/models/transformer_temporal
 
@@ -0,0 +1,55 @@
+# AsymmetricAutoencoderKL
+
+Improved larger variational autoencoder (VAE) model with KL loss for inpainting task: [Designing a Better Asymmetric VQGAN for StableDiffusion](https://arxiv.org/abs/2306.04632) by Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua.
+
+The abstract from the paper is:
+
+*StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN*
+
+Evaluation results can be found in section 4.1 of the original paper. 
+
+## Available checkpoints
+
+* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-1-5)
+* [https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2](https://huggingface.co/cross-attention/asymmetric-autoencoder-kl-x-2)
+
+## Example Usage
+
+```python
+from io import BytesIO
+from PIL import Image
+import requests
+from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
+
+
+def download_image(url: str) -> Image.Image:
+    response = requests.get(url)
+    return Image.open(BytesIO(response.content)).convert("RGB")
+
+
+prompt = "a photo of a person"
+img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
+mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
+
+image = download_image(img_url).resize((256, 256))
+mask_image = download_image(mask_url).resize((256, 256))
+
+pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
+pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
+pipe.to("cuda")
+
+image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
+image.save("image.jpeg")
+```
+
+## AsymmetricAutoencoderKL
+
+[[autodoc]] models.autoencoder_asym_kl.AsymmetricAutoencoderKL
+
+## AutoencoderKLOutput
+
+[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
+
+## DecoderOutput
+
+[[autodoc]] models.vae.DecoderOutput
@@ -0,0 +1,184 @@
+import argparse
+import time
+from pathlib import Path
+from typing import Any, Dict, Literal
+
+import torch
+
+from diffusers import AsymmetricAutoencoderKL
+
+
+ASYMMETRIC_AUTOENCODER_KL_x_1_5_CONFIG = {
+    "in_channels": 3,
+    "out_channels": 3,
+    "down_block_types": [
+        "DownEncoderBlock2D",
+        "DownEncoderBlock2D",
+        "DownEncoderBlock2D",
+        "DownEncoderBlock2D",
+    ],
+    "down_block_out_channels": [128, 256, 512, 512],
+    "layers_per_down_block": 2,
+    "up_block_types": [
+        "UpDecoderBlock2D",
+        "UpDecoderBlock2D",
+        "UpDecoderBlock2D",
+        "UpDecoderBlock2D",
+    ],
+    "up_block_out_channels": [192, 384, 768, 768],
+    "layers_per_up_block": 3,
+    "act_fn": "silu",
+    "latent_channels": 4,
+    "norm_num_groups": 32,
+    "sample_size": 256,
+    "scaling_factor": 0.18215,
+}
+
+ASYMMETRIC_AUTOENCODER_KL_x_2_CONFIG = {
+    "in_channels": 3,
+    "out_channels": 3,
+    "down_block_types": [
+        "DownEncoderBlock2D",
+        "DownEncoderBlock2D",
+        "DownEncoderBlock2D",
+        "DownEncoderBlock2D",
+    ],
+    "down_block_out_channels": [128, 256, 512, 512],
+    "layers_per_down_block": 2,
+    "up_block_types": [
+        "UpDecoderBlock2D",
+        "UpDecoderBlock2D",
+        "UpDecoderBlock2D",
+        "UpDecoderBlock2D",
+    ],
+    "up_block_out_channels": [256, 512, 1024, 1024],
+    "layers_per_up_block": 5,
+    "act_fn": "silu",
+    "latent_channels": 4,
+    "norm_num_groups": 32,
+    "sample_size": 256,
+    "scaling_factor": 0.18215,
+}
+
+
+def convert_asymmetric_autoencoder_kl_state_dict(original_state_dict: Dict[str, Any]) -> Dict[str, Any]:
+    converted_state_dict = {}
+    for k, v in original_state_dict.items():
+        if k.startswith("encoder."):
+            converted_state_dict[
+                k.replace("encoder.down.", "encoder.down_blocks.")
+                .replace("encoder.mid.", "encoder.mid_block.")
+                .replace("encoder.norm_out.", "encoder.conv_norm_out.")
+                .replace(".downsample.", ".downsamplers.0.")
+                .replace(".nin_shortcut.", ".conv_shortcut.")
+                .replace(".block.", ".resnets.")
+                .replace(".block_1.", ".resnets.0.")
+                .replace(".block_2.", ".resnets.1.")
+                .replace(".attn_1.k.", ".attentions.0.to_k.")
+                .replace(".attn_1.q.", ".attentions.0.to_q.")
+                .replace(".attn_1.v.", ".attentions.0.to_v.")
+                .replace(".attn_1.proj_out.", ".attentions.0.to_out.0.")
+                .replace(".attn_1.norm.", ".attentions.0.group_norm.")
+            ] = v
+        elif k.startswith("decoder.") and "up_layers" not in k:
+            converted_state_dict[
+                k.replace("decoder.encoder.", "decoder.condition_encoder.")
+                .replace(".norm_out.", ".conv_norm_out.")
+                .replace(".up.0.", ".up_blocks.3.")
+                .replace(".up.1.", ".up_blocks.2.")
+                .replace(".up.2.", ".up_blocks.1.")
+                .replace(".up.3.", ".up_blocks.0.")
+                .replace(".block.", ".resnets.")
+                .replace("mid", "mid_block")
+                .replace(".0.upsample.", ".0.upsamplers.0.")
+                .replace(".1.upsample.", ".1.upsamplers.0.")
+                .replace(".2.upsample.", ".2.upsamplers.0.")
+                .replace(".nin_shortcut.", ".conv_shortcut.")
+                .replace(".block_1.", ".resnets.0.")
+                .replace(".block_2.", ".resnets.1.")
+                .replace(".attn_1.k.", ".attentions.0.to_k.")
+                .replace(".attn_1.q.", ".attentions.0.to_q.")
+                .replace(".attn_1.v.", ".attentions.0.to_v.")
+                .replace(".attn_1.proj_out.", ".attentions.0.to_out.0.")
+                .replace(".attn_1.norm.", ".attentions.0.group_norm.")
+            ] = v
+        elif k.startswith("quant_conv."):
+            converted_state_dict[k] = v
+        elif k.startswith("post_quant_conv."):
+            converted_state_dict[k] = v
+        else:
+            print(f"  skipping key `{k}`")
+    # fix weights shape
+    for k, v in converted_state_dict.items():
+        if (
+            (k.startswith("encoder.mid_block.attentions.0") or k.startswith("decoder.mid_block.attentions.0"))
+            and k.endswith("weight")
+            and ("to_q" in k or "to_k" in k or "to_v" in k or "to_out" in k)
+        ):
+            converted_state_dict[k] = converted_state_dict[k][:, :, 0, 0]
+
+    return converted_state_dict
+
+
+def get_asymmetric_autoencoder_kl_from_original_checkpoint(
+    scale: Literal["1.5", "2"], original_checkpoint_path: str, map_location: torch.device
+) -> AsymmetricAutoencoderKL:
+    print("Loading original state_dict")
+    original_state_dict = torch.load(original_checkpoint_path, map_location=map_location)
+    original_state_dict = original_state_dict["state_dict"]
+    print("Converting state_dict")
+    converted_state_dict = convert_asymmetric_autoencoder_kl_state_dict(original_state_dict)
+    kwargs = ASYMMETRIC_AUTOENCODER_KL_x_1_5_CONFIG if scale == "1.5" else ASYMMETRIC_AUTOENCODER_KL_x_2_CONFIG
+    print("Initializing AsymmetricAutoencoderKL model")
+    asymmetric_autoencoder_kl = AsymmetricAutoencoderKL(**kwargs)
+    print("Loading weight from converted state_dict")
+    asymmetric_autoencoder_kl.load_state_dict(converted_state_dict)
+    asymmetric_autoencoder_kl.eval()
+    print("AsymmetricAutoencoderKL successfully initialized")
+    return asymmetric_autoencoder_kl
+
+
+if __name__ == "__main__":
+    start = time.time()
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--scale",
+        default=None,
+        type=str,
+        required=True,
+        help="Asymmetric VQGAN scale: `1.5` or `2`",
+    )
+    parser.add_argument(
+        "--original_checkpoint_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to the original Asymmetric VQGAN checkpoint",
+    )
+    parser.add_argument(
+        "--output_path",
+        default=None,
+        type=str,
+        required=True,
+        help="Path to save pretrained AsymmetricAutoencoderKL model",
+    )
+    parser.add_argument(
+        "--map_location",
+        default="cpu",
+        type=str,
+        required=False,
+        help="The device passed to `map_location` when loading the checkpoint",
+    )
+    args = parser.parse_args()
+
+    assert args.scale in ["1.5", "2"], f"{args.scale} should be `1.5` of `2`"
+    assert Path(args.original_checkpoint_path).is_file()
+
+    asymmetric_autoencoder_kl = get_asymmetric_autoencoder_kl_from_original_checkpoint(
+        scale=args.scale,
+        original_checkpoint_path=args.original_checkpoint_path,
+        map_location=torch.device(args.map_location),
+    )
+    print("Saving pretrained AsymmetricAutoencoderKL")
+    asymmetric_autoencoder_kl.save_pretrained(args.output_path)
+    print(f"Done in {time.time() - start:.2f} seconds")
@@ -36,6 +36,7 @@
     from .utils.dummy_pt_objects import *  # noqa F403
 else:
     from .models import (
+        AsymmetricAutoencoderKL,
         AutoencoderKL,
         ControlNetModel,
         ModelMixin,
 
@@ -17,6 +17,7 @@
 
 if is_torch_available():
     from .adapter import MultiAdapter, T2IAdapter
+    from .autoencoder_asym_kl import AsymmetricAutoencoderKL
     from .autoencoder_kl import AutoencoderKL
     from .controlnet import ControlNetModel
     from .dual_transformer_2d import DualTransformer2DModel