dwang68
diff --git a/‎docs/source/en/api/pipelines/kandinsky.mdx‎
Lines changed: 119 additions & 74 deletions b/‎docs/source/en/api/pipelines/kandinsky.mdx‎
Lines changed: 119 additions & 74 deletions
diff --git a/‎src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py‎
Lines changed: 6 additions & 5 deletions b/‎src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py‎
Lines changed: 6 additions & 5 deletions
diff --git a/‎src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py‎
Lines changed: 6 additions & 5 deletions b/‎src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py‎
Lines changed: 6 additions & 5 deletions
@@ -19,81 +19,78 @@ The Kandinsky model is created by [Arseniy Shakhmatov](https://github.com/cene55
 
 ## Available Pipelines:
 
-| Pipeline | Tasks | Colab
-|---|---|:---:|
-| [pipeline_kandinsky.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py) | *Text-to-Image Generation* | - |
-| [pipeline_kandinsky_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py) | *Image-Guided Image Generation* | - |
-| [pipeline_kandinsky_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py) | *Image-Guided Image Generation* | - |
+| Pipeline | Tasks |
+|---|---|
+| [pipeline_kandinsky.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky.py) | *Text-to-Image Generation* |
+| [pipeline_kandinsky_inpaint.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_inpaint.py) | *Image-Guided Image Generation* |
+| [pipeline_kandinsky_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/kandinsky/pipeline_kandinsky_img2img.py) | *Image-Guided Image Generation* |
 
 ## Usage example
 
-In the following, we will walk you through some cool examples of using the Kandinsky pipelines to create some visually aesthetic artwork.
+In the following, we will walk you through some examples of how to use the Kandinsky pipelines to create some visually aesthetic artwork.
 
 ### Text-to-Image Generation
 
-For text-to-image generation, we need to use both [`KandinskyPriorPipeline`] and [`KandinskyPipeline`]. The first step is to encode text prompts with CLIP and then diffuse the CLIP text embeddings to CLIP image embeddings, as first proposed in [DALL-E 2](https://cdn.openai.com/papers/dall-e-2.pdf). Let's throw a fun prompt at Kandinsky to see what it comes up with :)
+For text-to-image generation, we need to use both [`KandinskyPriorPipeline`] and [`KandinskyPipeline`].
+The first step is to encode text prompts with CLIP and then diffuse the CLIP text embeddings to CLIP image embeddings,
+as first proposed in [DALL-E 2](https://cdn.openai.com/papers/dall-e-2.pdf).
+Let's throw a fun prompt at Kandinsky to see what it comes up with.
 
-```python
+```py
 prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
-negative_prompt = "low quality, bad quality"
 ```
 
-We will pass both the `prompt` and `negative_prompt` to our prior diffusion pipeline. In contrast to other diffusion pipelines, such as Stable Diffusion, the `prompt` and `negative_prompt` shall be passed separately so that we can retrieve a CLIP image embedding for each prompt input. You can use `guidance_scale`, and `num_inference_steps` arguments to guide this process, just like how you would normally do with all other pipelines in diffusers.
+First, let's instantiate the prior pipeline and the text-to-image pipeline. Both 
+pipelines are diffusion models.
 
-```python
-from diffusers import KandinskyPriorPipeline
+
+```py
+from diffusers import DiffusionPipeline
 import torch
 
-# create prior
-pipe_prior = KandinskyPriorPipeline.from_pretrained(
-    "kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
-)
+pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
 pipe_prior.to("cuda")
 
-generator = torch.Generator(device="cuda").manual_seed(12)
-image_emb = pipe_prior(
-    prompt, guidance_scale=1.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt
-).images
+t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+t2i_pipe.to("cuda")
+```
 
-zero_image_emb = pipe_prior(
-    negative_prompt, guidance_scale=1.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt
-).images
+Now we pass the prompt through the prior to generate image embeddings. The prior
+returns both the image embeddings corresponding to the prompt and negative/unconditional image 
+embeddings corresponding to an empty string.
+
+```py
+generator = torch.Generator(device="cuda").manual_seed(12)
+image_embeds, negative_image_embeds = pipe_prior(prompt, generator=generator).to_tuple()
 ```
 
-Once we create the image embedding, we can use [`KandinskyPipeline`] to generate images.
+<Tip warning={true}>
 
-```python
-from PIL import Image
-from diffusers import KandinskyPipeline
+The text-to-image pipeline expects both `image_embeds`, `negative_image_embeds` and the original 
+`prompt` as the text-to-image pipeline uses another text encoder to better guide the second diffusion 
+process of `t2i_pipe`.
 
+By default, the prior returns unconditioned negative image embeddings corresponding to the negative prompt of `""`.
+For better results, you can also pass a `negative_prompt` to the prior. This will increase the effective batch size
+of the prior by a factor of 2.
 
-def image_grid(imgs, rows, cols):
-    assert len(imgs) == rows * cols
+```py
+prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
+negative_prompt = "low quality, bad quality"
 
-    w, h = imgs[0].size
-    grid = Image.new("RGB", size=(cols * w, rows * h))
-    grid_w, grid_h = grid.size
+image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, generator=generator).to_tuple()
+```
 
-    for i, img in enumerate(imgs):
-        grid.paste(img, box=(i % cols * w, i // cols * h))
-    return grid
+</Tip>
 
 
-# create diffuser pipeline
-pipe = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
-pipe.to("cuda")
+Next, we can pass the embeddings as well as the prompt to the text-to-image pipeline. Remember that 
+in case you are using a customized negative prompt, that you should pass this one also to the text-to-image pipelines
+with `negative_prompt=negative_prompt`:
 
-images = pipe(
-    prompt,
-    image_embeds=image_emb,
-    negative_image_embeds=zero_image_emb,
-    num_images_per_prompt=2,
-    height=768,
-    width=768,
-    num_inference_steps=100,
-    guidance_scale=4.0,
-    generator=generator,
-).images
+```py
+image = t2i_pipe(prompt, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds).images[0]
+image.save("cheeseburger_monster.png")
 ```
 
 One cheeseburger monster coming up! Enjoy! 
@@ -164,22 +161,15 @@ prompt = "A fantasy landscape, Cinematic lighting"
 negative_prompt = "low quality, bad quality"
 
 generator = torch.Generator(device="cuda").manual_seed(30)
-image_emb = pipe_prior(
-    prompt, guidance_scale=4.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt
-).images
-
-zero_image_emb = pipe_prior(
-    negative_prompt, guidance_scale=4.0, num_inference_steps=25, generator=generator, negative_prompt=negative_prompt
-).images
+image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, generator=generator).to_tuple()
 
 out = pipe(
     prompt,
     image=original_image,
-    image_embeds=image_emb,
-    negative_image_embeds=zero_image_emb,
+    image_embeds=image_embeds,
+    negative_image_embeds=negative_image_embeds,
     height=768,
     width=768,
-    num_inference_steps=500,
     strength=0.3,
 )
 
@@ -193,7 +183,7 @@ out.images[0].save("fantasy_land.png")
 
 You can use [`KandinskyInpaintPipeline`] to edit images. In this example, we will add a hat to the portrait of a cat.
 
-```python
+```py
 from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
 from diffusers.utils import load_image
 import torch
@@ -205,7 +195,7 @@ pipe_prior = KandinskyPriorPipeline.from_pretrained(
 pipe_prior.to("cuda")
 
 prompt = "a hat"
-image_emb, zero_image_emb = pipe_prior(prompt, return_dict=False)
+prior_output = pipe_prior(prompt)
 
 pipe = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
 pipe.to("cuda")
@@ -222,8 +212,7 @@ out = pipe(
     prompt,
     image=init_image,
     mask_image=mask,
-    image_embeds=image_emb,
-    negative_image_embeds=zero_image_emb,
+    **prior_output,
     height=768,
     width=768,
     num_inference_steps=150,
@@ -246,7 +235,6 @@ from diffusers.utils import load_image
 import PIL
 
 import torch
-from torchvision import transforms
 
 pipe_prior = KandinskyPriorPipeline.from_pretrained(
     "kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16
@@ -263,22 +251,80 @@ img2 = load_image(
 
 # add all the conditions we want to interpolate, can be either text or image
 images_texts = ["a cat", img1, img2]
+
 # specify the weights for each condition in images_texts
 weights = [0.3, 0.3, 0.4]
-image_emb, zero_image_emb = pipe_prior.interpolate(images_texts, weights)
+
+# We can leave the prompt empty
+prompt = ""
+prior_out = pipe_prior.interpolate(images_texts, weights)
 
 pipe = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
 pipe.to("cuda")
 
-image = pipe(
-    "", image_embeds=image_emb, negative_image_embeds=zero_image_emb, height=768, width=768, num_inference_steps=150
-).images[0]
+image = pipe(prompt, **prior_out, height=768, width=768).images[0]
 
 image.save("starry_cat.png")
 ```
 ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/starry_cat.png)
 
 
+## Optimization
+
+Running Kandinsky in inference requires running both a first prior pipeline: [`KandinskyPriorPipeline`]
+and a second image decoding pipeline which is one of [`KandinskyPipeline`], [`KandinskyImg2ImgPipeline`], or [`KandinskyInpaintPipeline`].
+
+The bulk of the computation time will always be the second image decoding pipeline, so when looking 
+into optimizing the model, one should look into the second image decoding pipeline.
+
+When running with PyTorch < 2.0, we strongly recommend making use of [`xformers`](https://github.com/facebookresearch/xformers)
+to speed-up the optimization. This can be done by simply running:
+
+```py
+from diffusers import DiffusionPipeline
+import torch
+
+t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+t2i_pipe.enable_xformers_memory_efficient_attention()
+```
+
+When running on PyTorch >= 2.0, PyTorch's SDPA attention will automatically be used. For more information on 
+PyTorch's SDPA, feel free to have a look at [this blog post](https://pytorch.org/blog/accelerated-diffusers-pt-20/).
+
+To have explicit control , you can also manually set the pipeline to use PyTorch's 2.0 efficient attention:
+
+```py
+from diffusers.models.attention_processor import AttnAddedKVProcessor2_0
+
+t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
+```
+
+The slowest and most memory intense attention processor is the default `AttnAddedKVProcessor` processor.
+We do **not** recommend using it except for testing purposes or cases where very high determistic behaviour is desired. 
+You can set it with:
+
+```py
+from diffusers.models.attention_processor import AttnAddedKVProcessor
+
+t2i_pipe.unet.set_attn_processor(AttnAddedKVProcessor())
+```
+
+With PyTorch >= 2.0, you can also use Kandinsky with `torch.compile` which depending 
+on your hardware can signficantly speed-up your inference time once the model is compiled.
+To use Kandinsksy with `torch.compile`, you can do:
+
+```py
+t2i_pipe.unet.to(memory_format=torch.channels_last)
+t2i_pipe.unet = torch.compile(t2i_pipe.unet, mode="reduce-overhead", fullgraph=True)
+```
+
+After compilation you should see a very fast inference time. For more information,
+feel free to have a look at [Our PyTorch 2.0 benchmark](https://huggingface.co/docs/diffusers/main/en/optimization/torch2.0).
+
+
+
+
+
 ## KandinskyPriorPipeline
 
 [[autodoc]] KandinskyPriorPipeline
@@ -292,15 +338,14 @@ image.save("starry_cat.png")
 	- all
 	- __call__
 
-## KandinskyInpaintPipeline
-
-[[autodoc]] KandinskyInpaintPipeline
-	- all
-	- __call__
-	
 ## KandinskyImg2ImgPipeline
 
 [[autodoc]] KandinskyImg2ImgPipeline
 	- all
 	- __call__
 
+## KandinskyInpaintPipeline
+
+[[autodoc]] KandinskyInpaintPipeline
+	- all
+	- __call__
@@ -304,12 +304,12 @@ def __call__(
         prompt: Union[str, List[str]],
         image_embeds: Union[torch.FloatTensor, List[torch.FloatTensor]],
         negative_image_embeds: Union[torch.FloatTensor, List[torch.FloatTensor]],
+        negative_prompt: Optional[Union[str, List[str]]] = None,
         height: int = 512,
         width: int = 512,
         num_inference_steps: int = 100,
         guidance_scale: float = 4.0,
         num_images_per_prompt: int = 1,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
         generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
         latents: Optional[torch.FloatTensor] = None,
         output_type: Optional[str] = "pil",
@@ -325,6 +325,9 @@ def __call__(
                 The clip image embeddings for text prompt, that will be used to condition the image generation.
             negative_image_embeds (`torch.FloatTensor` or `List[torch.FloatTensor]`):
                 The clip image embeddings for negative text prompt, will be used to condition the image generation.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
+                if `guidance_scale` is less than `1`).
             height (`int`, *optional*, defaults to 512):
                 The height in pixels of the generated image.
             width (`int`, *optional*, defaults to 512):
@@ -340,9 +343,6 @@ def __call__(
                 usually at the expense of lower image quality.
             num_images_per_prompt (`int`, *optional*, defaults to 1):
                 The number of images to generate per prompt.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
-                if `guidance_scale` is less than `1`).
             generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
                 One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
                 to make generation deterministic.
@@ -418,7 +418,8 @@ def __call__(
                 timestep=t,
                 encoder_hidden_states=text_encoder_hidden_states,
                 added_cond_kwargs=added_cond_kwargs,
-            ).sample
+                return_dict=False,
+            )[0]
 
             if do_classifier_free_guidance:
                 noise_pred, variance_pred = noise_pred.split(latents.shape[1], dim=1)
 
@@ -368,13 +368,13 @@ def __call__(
         image: Union[torch.FloatTensor, PIL.Image.Image, List[torch.FloatTensor], List[PIL.Image.Image]],
         image_embeds: torch.FloatTensor,
         negative_image_embeds: torch.FloatTensor,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
         height: int = 512,
         width: int = 512,
         num_inference_steps: int = 100,
         strength: float = 0.3,
         guidance_scale: float = 7.0,
         num_images_per_prompt: int = 1,
-        negative_prompt: Optional[Union[str, List[str]]] = None,
         generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
         output_type: Optional[str] = "pil",
         return_dict: bool = True,
@@ -392,6 +392,9 @@ def __call__(
                 The clip image embeddings for text prompt, that will be used to condition the image generation.
             negative_image_embeds (`torch.FloatTensor` or `List[torch.FloatTensor]`):
                 The clip image embeddings for negative text prompt, will be used to condition the image generation.
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
+                if `guidance_scale` is less than `1`).
             height (`int`, *optional*, defaults to 512):
                 The height in pixels of the generated image.
             width (`int`, *optional*, defaults to 512):
@@ -413,9 +416,6 @@ def __call__(
                 usually at the expense of lower image quality.
             num_images_per_prompt (`int`, *optional*, defaults to 1):
                 The number of images to generate per prompt.
-            negative_prompt (`str` or `List[str]`, *optional*):
-                The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
-                if `guidance_scale` is less than `1`).
             generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
                 One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
                 to make generation deterministic.
@@ -512,7 +512,8 @@ def __call__(
                 timestep=t,
                 encoder_hidden_states=text_encoder_hidden_states,
                 added_cond_kwargs=added_cond_kwargs,
-            ).sample
+                return_dict=False,
+            )[0]
 
             if do_classifier_free_guidance:
                 noise_pred, _ = noise_pred.split(latents.shape[1], dim=1)