Skip to content

Commit 4f99619

Browse files
committed
first draft
1 parent 946bb53 commit 4f99619

File tree

1 file changed

+264
-27
lines changed

1 file changed

+264
-27
lines changed

docs/source/en/using-diffusers/conditional_image_generation.md

Lines changed: 264 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -10,51 +10,288 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# Conditional image generation
13+
# Text-to-image
1414

1515
[[open-in-colab]]
1616

17-
Conditional image generation allows you to generate images from a text prompt. The text is converted into embeddings which are used to condition the model to generate an image from noise.
17+
Text-to-image generates an image from a text description (for example, "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k") which is also known as a *prompt*. From a very high-level, a latent diffusion model takes a prompt and some random initial noise, and iteratively removes the noise to construct an image. The *denoising* process is guided by the prompt. Once the denoising process ends after a predetermined number of timesteps, the latent image representation is decoded into an image.
1818

19-
The [`DiffusionPipeline`] is the easiest way to use a pre-trained diffusion system for inference.
19+
<Tip>
2020

21-
Start by creating an instance of [`DiffusionPipeline`] and specify which pipeline [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) you would like to download.
21+
Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog post to learn more details about how the model works.
2222

23-
In this guide, you'll use [`DiffusionPipeline`] for text-to-image generation with [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5):
23+
</Tip>
2424

25-
```python
26-
>>> from diffusers import DiffusionPipeline
25+
You can do this in 🤗 Diffusers in just two steps:
2726

28-
>>> generator = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True)
27+
1. Load a checkpoint into the [`AutoPipelineForText2Image`] class, which automatically detects the appropriate pipeline class to use based on the checkpoint:
28+
29+
```py
30+
from diffusers import AutoPipelineForText2Image
31+
32+
pipeline = AutoPipelineForText2Image.from_pretrained(
33+
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
34+
).to("cuda")
35+
```
36+
37+
2. Pass a prompt to the pipeline to generate an image:
38+
39+
```py
40+
image = pipeline(
41+
"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
42+
).images[0]
43+
```
44+
45+
<div class="flex justify-center">
46+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-vader.png"/>
47+
</div>
48+
49+
## Popular models
50+
51+
The most common text-to-image models are [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](sdxl), Kandinsky 2.2, and [ControlNet](controlnet). The results from each model are slightly different because of their architecture and training process, but no matter which model you choose, their usage is more or less the same.
52+
53+
### Stable Diffusion v1.5
54+
55+
Stable Diffusion v1.5 is a latent diffusion model initialized from an earlier checkpoint, and finetuned for 595K steps on 512x512 images from the LAION-Aesthetics V2 dataset. You can use this model like:
56+
57+
```py
58+
from diffusers import AutoPipelineForText2Image
59+
import torch
60+
61+
pipeline = AutoPipelineForText2Image.from_pretrained(
62+
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
63+
).to("cuda")
64+
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
65+
```
66+
67+
### Stable Diffusion XL
68+
69+
SDXL is a much larger version of the previous Stable Diffusion models, and involves a two-stage model process that adds even more details to an image. It also includes some additional *micro-conditionings* to generate high-quality centered images. Take a look at the more comprehensive [SDXL](sdxl) guide to learn more about how to use it. But in general, you can use SDXL like:
70+
71+
```py
72+
from diffusers import AutoPipelineForText2Image
73+
import torch
74+
75+
pipeline = AutoPipelineForText2Image.from_pretrained(
76+
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
77+
).to("cuda")
78+
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
2979
```
3080

31-
The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components.
32-
Because the model consists of roughly 1.4 billion parameters, we strongly recommend running it on a GPU.
33-
You can move the generator object to a GPU, just like you would in PyTorch:
81+
### Kandinsky 2.2
3482

35-
```python
36-
>>> generator.to("cuda")
83+
The Kandinsky model is a bit different from the Stable Diffusion models because it also uses an image prior model to create embeddings that are used to better align text and images in the diffusion model. Take a look at the more comprehensive Kandinsky guide to learn more about how to use it. The easiest way to use Kandinsky 2.2 is:
84+
85+
```py
86+
from diffusers import AutoPipelineForText2Image
87+
import torch
88+
89+
pipeline = AutoPipelineForText2Image.from_pretrained(
90+
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
91+
).to("cuda")
92+
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
3793
```
3894

39-
Now you can use the `generator` on your text prompt:
95+
### ControlNet
96+
97+
ControlNet models offers a diverse set of options for more explicit control over a generated image. With ControlNet's, you add an additional conditioning input image to the model. For example, if you provide an image of a human pose (usually represented as multiple keypoints that are connected into a skeleton) as a conditioning input, the model generates an image that follows the pose of the image. Check out the more in-depth [ControlNet](controlnet) guide to learn more about other conditioning inputs and how to use them.
98+
99+
In this example, let's condition the ControlNet with a human pose estimation image. Load the ControlNet model pretrained on human pose estimations:
100+
101+
```py
102+
from diffusers import ControlNetModel, AutoPipelineForText2Image
103+
import torch
40104

41-
```python
42-
>>> image = generator("An image of a squirrel in Picasso style").images[0]
105+
controlnet = ControlNetModel.from_pretrained(
106+
"lllyasviel/control_v11p_sd15_openpose", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
107+
).to("cuda")
108+
pose_image = load_image("https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png")
43109
```
44110

45-
The output is by default wrapped into a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object.
111+
Pass the `controlnet` to the [`AutoPipelineForText2Image`], and provide the prompt and pose estimation image to control generation:
112+
113+
```py
114+
pipeline = AutoPipelineForText2Image.from_pretrained(
115+
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16", use_safetensors=True
116+
).to("cuda")
117+
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=pose_image).images[0]
118+
```
119+
120+
<div class="flex flex-row gap-4">
121+
<div class="flex-1">
122+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-1.png"/>
123+
<figcaption class="mt-2 text-center text-sm text-gray-500">Stable Diffusion v1.5</figcaption>
124+
</div>
125+
<div class="flex-1">
126+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"/>
127+
<figcaption class="mt-2 text-center text-sm text-gray-500">Stable Diffusion XL</figcaption>
128+
</div>
129+
<div class="flex-1">
130+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-2.png"/>
131+
<figcaption class="mt-2 text-center text-sm text-gray-500">Kandinsky 2.2</figcaption>
132+
</div>
133+
<div class="flex-1">
134+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-3.png"/>
135+
<figcaption class="mt-2 text-center text-sm text-gray-500">ControlNet (pose detection)</figcaption>
136+
</div>
137+
</div>
138+
139+
## Configure pipeline parameters
46140

47-
You can save the image by calling:
141+
There are a number of parameters that can be configured in the pipeline that affect how an image is generated. You can change the image's output size, specify a negative prompt to improve image quality, and more. This section dives deeper into how to use these parameters.
48142

49-
```python
50-
>>> image.save("image_of_squirrel_painting.png")
143+
### Height and width
144+
145+
The `height` and `width` parameters control the height and width in pixels of the generated image. By default, the Stable Diffusion v1.5 model outputs 512x512 images, but you can change this to any size you want. For example, to create a rectangular image:
146+
147+
```py
148+
from diffusers import AutoPipelineForText2Image
149+
import torch
150+
151+
pipeline = AutoPipelineForText2Image.from_pretrained(
152+
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
153+
).to("cuda")
154+
image = pipeline(
155+
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", height=768, width=512
156+
).images[0]
51157
```
52158

53-
Try out the Spaces below, and feel free to play around with the guidance scale parameter to see how it affects the image quality!
159+
<div class="flex justify-center">
160+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-hw.png"/>
161+
</div>
162+
163+
<Tip warning={true}>
164+
165+
Other models may have different default image sizes depending on the image size's in the training dataset. For example, SDXL's default image size is 1024x1024 and using lower `height` and `width` values may result in lower quality images. Make sure you check the model's API reference first!
166+
167+
</Tip>
168+
169+
### Guidance scale
170+
171+
The `guidance_scale` parameter determines how important the prompt is in guiding image generation. A lower value gives the model more "creativity" to generate images that are more loosely related to the prompt. Higher `guidance_scale` values push the model to follow the prompt more closely, and if this value is too high, you may observe some artifacts in the generated image.
172+
173+
```py
174+
from diffusers import AutoPipelineForText2Image
175+
import torch
176+
177+
pipeline = AutoPipelineForText2Image.from_pretrained(
178+
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
179+
).to("cuda")
180+
image = pipeline(
181+
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", guidance_scale=3.5
182+
).images[0]
183+
```
184+
185+
<div class="flex flex-row gap-4">
186+
<div class="flex-1">
187+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-guidance-scale-2.5.png"/>
188+
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 2.5</figcaption>
189+
</div>
190+
<div class="flex-1">
191+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-guidance-scale-7.5.png"/>
192+
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 7.5</figcaption>
193+
</div>
194+
<div class="flex-1">
195+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-guidance-scale-10.5.png"/>
196+
<figcaption class="mt-2 text-center text-sm text-gray-500">guidance_scale = 10.5</figcaption>
197+
</div>
198+
</div>
199+
200+
### Negative prompt
201+
202+
Just like how a prompt guides generation, a *negative prompt* steers the model away from things you don't want the model to generate. This is commonly used to improve overall image quality by removing poor or bad image features such as "low resolution" or "bad details". You can also use a negative prompt to remove or modify the content and style of an image.
203+
204+
```py
205+
from diffusers import AutoPipelineForText2Image
206+
import torch
207+
208+
pipeline = AutoPipelineForText2Image.from_pretrained(
209+
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
210+
).to("cuda")
211+
image = pipeline(
212+
prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
213+
negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy",
214+
).images[0]
215+
```
216+
217+
<div class="flex flex-row gap-4">
218+
<div class="flex-1">
219+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-neg-prompt-1.png"/>
220+
<figcaption class="mt-2 text-center text-sm text-gray-500">negative prompt = "ugly, deformed, disfigured, poor details, bad anatomy"</figcaption>
221+
</div>
222+
<div class="flex-1">
223+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/text2img-neg-prompt-2.png"/>
224+
<figcaption class="mt-2 text-center text-sm text-gray-500">negative prompt = "astronaut"</figcaption>
225+
</div>
226+
</div>
227+
228+
### Generator
229+
230+
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html#generator) object is used to enable reproducibility in a pipeline by setting a manual seed. However, you can also use a `Generator` to generate batches of images and iteratively improve on an image generated from a seed as detailed in the [Improve image quality with deterministic generation](reusing_seeds) guide.
231+
232+
You can set a seed and `Generator` as shown below. Creating an image with a `Generator` should return the same result each time now, instead of randomly generating a new image.
233+
234+
```py
235+
from diffusers import AutoPipelineForText2Image
236+
import torch
237+
238+
pipeline = AutoPipelineForText2Image.from_pretrained(
239+
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
240+
).to("cuda")
241+
generator = torch.Generator(device="cuda").manual_seed(30)
242+
image = pipeline(
243+
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
244+
generator=generator,
245+
).images[0]
246+
```
247+
248+
## Control image generation
249+
250+
There are several ways to exert more control over how an image is generated outside of configuring a pipeline's parameters, such as prompt weighting and ControlNet models.
251+
252+
### Prompt weighting
253+
254+
Prompt weighting is a technique for increasing or decreasing the importance of concepts in a prompt to emphasize or minimize certain features in an image. We recommend using the [Compel](https://github.com/damian0815/compel) library to help you generate the weighted prompt embeddings.
255+
256+
<Tip>
257+
258+
Learn how to create the prompt embeddings in the [Prompt weighting](weighted_prompts) guide. This example focuses on how to use the prompt embeddings in the pipeline.
259+
260+
</Tip>
261+
262+
Once you've created the embeddings, you can pass them to the `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter in the pipeline.
263+
264+
```py
265+
from diffusers import AutoPipelineForText2Image
266+
import torch
267+
268+
pipeline = AutoPipelineForText2Image.from_pretrained(
269+
"runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True
270+
).to("cuda")
271+
image = pipeline(
272+
prompt_emebds=prompt_embeds, # generated from Compel
273+
negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
274+
).images[0]
275+
```
276+
277+
### ControlNet
278+
279+
As you saw in the [ControlNet](#controlnet) section, these models offer a more flexible and accurate way to generate images by incorporating an additional conditioning image input. The ControlNet is pretrained on the conditioning image input to generate new images that resemble it. For example, if you take a ControlNet pretrained on depth maps, you can give the model a depth map as a conditioning input and it'll generate an image that preserves the spatial information of it. This is quicker and easier than specifying the depth information in a prompt. You can even combine multiple conditioning inputs with a [MultiControlNet](controlnet#multicontrolnet)!
280+
281+
There are many types of conditioning inputs you can use, and 🤗 Diffusers supports ControlNet for Stable Diffusion and SDXL models. Take a look at the more comprehensive [ControlNet](controlnet) guide to learn how you can use these models.
282+
283+
## Optimize
284+
285+
Diffusion models are large, and the iterative nature of denoising an image is computationally expensive and intensive. But this doesn't mean you need access to powerful - or even many - GPUs to use them. There are many things you can do to run diffusion models on consumer and free-tier resources. For example, you can load model weights in half-precision to save GPU memory and increase speed or offloading the entire model to the GPU to preserve even more memory.
286+
287+
PyTorch 2.0 also supports a more memory-efficient attention mechanism called *scaled dot product attention* that is automatically enabled if you're using PyTorch 2.0. You can combine this with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) to speed your code up even more:
288+
289+
```py
290+
from diffusers import AutoPipelineForText2Image
291+
import torch
292+
293+
pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True).to("cuda")
294+
pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overheard", fullgraph=True)
295+
```
54296

55-
<iframe
56-
src="https://stabilityai-stable-diffusion.hf.space"
57-
frameborder="0"
58-
width="850"
59-
height="500"
60-
></iframe>
297+
For more tips on how to optimize your code to save memory and speed up inference, read the [Memory and speed](./optimization/fp16) and [Torch 2.0](./optimization/torch2.0) guides.

0 commit comments

Comments
 (0)