Skip to content

Commit b64f835

Browse files
authored
[docs] Add Kandinsky 3 (huggingface#5988)
* add * fix api docs * edits
1 parent 880c0fd commit b64f835

File tree

2 files changed

+47
-2
lines changed

2 files changed

+47
-2
lines changed

docs/source/en/using-diffusers/kandinsky.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ The Kandinsky models are a series of multilingual text-to-image generation model
2020

2121
[Kandinsky 2.2](../api/pipelines/kandinsky_v22) improves on the previous model by replacing the image encoder of the image prior model with a larger CLIP-ViT-G model to improve quality. The image prior model was also retrained on images with different resolutions and aspect ratios to generate higher-resolution images and different image sizes.
2222

23+
[Kandinsky 3](../api/pipelines/kandinsky3) simplifies the architecture and shifts away from the two-stage generation process involving the prior model and diffusion model. Instead, Kandinsky 3 uses [Flan-UL2](https://huggingface.co/google/flan-ul2) to encode text, a UNet with [BigGan-deep](https://hf.co/papers/1809.11096) blocks, and [Sber-MoVQGAN](https://github.com/ai-forever/MoVQGAN) to decode the latents into images. Text understanding and generated image quality are primarily achieved by using a larger text encoder and UNet.
24+
2325
This guide will show you how to use the Kandinsky models for text-to-image, image-to-image, inpainting, interpolation, and more.
2426

2527
Before you begin, make sure you have the following libraries installed:
@@ -33,6 +35,10 @@ Before you begin, make sure you have the following libraries installed:
3335

3436
Kandinsky 2.1 and 2.2 usage is very similar! The only difference is Kandinsky 2.2 doesn't accept `prompt` as an input when decoding the latents. Instead, Kandinsky 2.2 only accepts `image_embeds` during decoding.
3537

38+
<br>
39+
40+
Kandinsky 3 has a more concise architecture and it doesn't require a prior model. This means it's usage is identical to other diffusion models like [Stable Diffusion XL](sdxl).
41+
3642
</Tip>
3743

3844
## Text-to-image
@@ -91,6 +97,23 @@ image
9197
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-text-to-image.png"/>
9298
</div>
9399

100+
</hfoption>
101+
<hfoption id="Kandinsky 3">
102+
103+
Kandinsky 3 doesn't require a prior model so you can directly load the [`Kandinsky3Pipeline`] and pass a prompt to generate an image:
104+
105+
```py
106+
from diffusers import Kandinsky3Pipeline
107+
import torch
108+
109+
pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
110+
pipeline.enable_model_cpu_offload()
111+
112+
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
113+
image = pipeline(prompt).images[0]
114+
image
115+
```
116+
94117
</hfoption>
95118
</hfoptions>
96119

@@ -161,6 +184,20 @@ prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kan
161184
pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
162185
```
163186

187+
</hfoption>
188+
<hfoption id="Kandinsky 3">
189+
190+
Kandinsky 3 doesn't require a prior model so you can directly load the image-to-image pipeline:
191+
192+
```py
193+
from diffusers import Kandinsky3Img2ImgPipeline
194+
from diffusers.utils import load_image
195+
import torch
196+
197+
pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
198+
pipeline.enable_model_cpu_offload()
199+
```
200+
164201
</hfoption>
165202
</hfoptions>
166203

@@ -218,6 +255,14 @@ make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], r
218255
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-image-to-image.png"/>
219256
</div>
220257

258+
</hfoption>
259+
<hfoption id="Kandinsky 3">
260+
261+
```py
262+
image = pipeline(prompt, negative_prompt=negative_prompt, image=image, strength=0.75, num_inference_steps=25).images[0]
263+
image
264+
```
265+
221266
</hfoption>
222267
</hfoptions>
223268

src/diffusers/pipelines/kandinsky3/pipeline_kandinsky3.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ def encode_prompt(
110110
Encodes the prompt into text encoder hidden states.
111111
112112
Args:
113-
prompt (`str` or `List[str]`, *optional*):
113+
prompt (`str` or `List[str]`, *optional*):
114114
prompt to be encoded
115115
device: (`torch.device`, *optional*):
116116
torch device to place the resulting embeddings on
@@ -365,7 +365,7 @@ def __call__(
365365
prompt (`str` or `List[str]`, *optional*):
366366
The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
367367
instead.
368-
num_inference_steps (`int`, *optional*, defaults to 50):
368+
num_inference_steps (`int`, *optional*, defaults to 25):
369369
The number of denoising steps. More denoising steps usually lead to a higher quality image at the
370370
expense of slower inference.
371371
timesteps (`List[int]`, *optional*):

0 commit comments

Comments
 (0)