You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -10,51 +10,288 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
10
10
specific language governing permissions and limitations under the License.
11
11
-->
12
12
13
-
# Conditional image generation
13
+
# Text-to-image
14
14
15
15
[[open-in-colab]]
16
16
17
-
Conditional image generation allows you to generate images from a text prompt. The text is converted into embeddings which are used to condition the model to generate an image from noise.
17
+
Text-to-image generates an image from a text description (for example, "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k") which is also known as a *prompt*. From a very high-level, a latent diffusion model takes a prompt and some random initial noise, and iteratively removes the noise to construct an image. The *denoising* process is guided by the prompt. Once the denoising process ends after a predetermined number of timesteps, the latent image representation is decoded into an image.
18
18
19
-
The [`DiffusionPipeline`] is the easiest way to use a pre-trained diffusion system for inference.
19
+
<Tip>
20
20
21
-
Start by creating an instance of [`DiffusionPipeline`] and specify which pipeline [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) you would like to download.
21
+
Read the [How does Stable Diffusion work?](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work) blog post to learn more details about how the model works.
22
22
23
-
In this guide, you'll use [`DiffusionPipeline`] for text-to-image generation with [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5):
1. Load a checkpoint into the [`AutoPipelineForText2Image`] class, which automatically detects the appropriate pipeline class to use based on the checkpoint:
The most common text-to-image models are [Stable Diffusion v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL (SDXL)](sdxl), Kandinsky 2.2, and [ControlNet](controlnet). The results from each model are slightly different because of their architecture and training process, but no matter which model you choose, their usage is more or less the same.
52
+
53
+
### Stable Diffusion v1.5
54
+
55
+
Stable Diffusion v1.5 is a latent diffusion model initialized from an earlier checkpoint, and finetuned for 595K steps on 512x512 images from the LAION-Aesthetics V2 dataset. You can use this model like:
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
65
+
```
66
+
67
+
### Stable Diffusion XL
68
+
69
+
SDXL is a much larger version of the previous Stable Diffusion models, and involves a two-stage model process that adds even more details to an image. It also includes some additional *micro-conditionings* to generate high-quality centered images. Take a look at the more comprehensive [SDXL](sdxl) guide to learn more about how to use it. But in general, you can use SDXL like:
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
29
79
```
30
80
31
-
The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components.
32
-
Because the model consists of roughly 1.4 billion parameters, we strongly recommend running it on a GPU.
33
-
You can move the generator object to a GPU, just like you would in PyTorch:
81
+
### Kandinsky 2.2
34
82
35
-
```python
36
-
>>> generator.to("cuda")
83
+
The Kandinsky model is a bit different from the Stable Diffusion models because it also uses an image prior model to create embeddings that are used to better align text and images in the diffusion model. Take a look at the more comprehensive Kandinsky guide to learn more about how to use it. The easiest way to use Kandinsky 2.2 is:
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
37
93
```
38
94
39
-
Now you can use the `generator` on your text prompt:
95
+
### ControlNet
96
+
97
+
ControlNet models offers a diverse set of options for more explicit control over a generated image. With ControlNet's, you add an additional conditioning input image to the model. For example, if you provide an image of a human pose (usually represented as multiple keypoints that are connected into a skeleton) as a conditioning input, the model generates an image that follows the pose of the image. Check out the more in-depth [ControlNet](controlnet) guide to learn more about other conditioning inputs and how to use them.
98
+
99
+
In this example, let's condition the ControlNet with a human pose estimation image. Load the ControlNet model pretrained on human pose estimations:
100
+
101
+
```py
102
+
from diffusers import ControlNetModel, AutoPipelineForText2Image
103
+
import torch
40
104
41
-
```python
42
-
>>> image = generator("An image of a squirrel in Picasso style").images[0]
The output is by default wrapped into a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object.
111
+
Pass the `controlnet` to the [`AutoPipelineForText2Image`], and provide the prompt and pose estimation image to control generation:
There are a number of parameters that can be configured in the pipeline that affect how an image is generated. You can change the image's output size, specify a negative prompt to improve image quality, and more. This section dives deeper into how to use these parameters.
48
142
49
-
```python
50
-
>>> image.save("image_of_squirrel_painting.png")
143
+
### Height and width
144
+
145
+
The `height` and `width` parameters control the height and width in pixels of the generated image. By default, the Stable Diffusion v1.5 model outputs 512x512 images, but you can change this to any size you want. For example, to create a rectangular image:
Other models may have different default image sizes depending on the image size's in the training dataset. For example, SDXL's default image size is 1024x1024 and using lower `height` and `width` values may result in lower quality images. Make sure you check the model's API reference first!
166
+
167
+
</Tip>
168
+
169
+
### Guidance scale
170
+
171
+
The `guidance_scale` parameter determines how important the prompt is in guiding image generation. A lower value gives the model more "creativity" to generate images that are more loosely related to the prompt. Higher `guidance_scale` values push the model to follow the prompt more closely, and if this value is too high, you may observe some artifacts in the generated image.
Just like how a prompt guides generation, a *negative prompt* steers the model away from things you don't want the model to generate. This is commonly used to improve overall image quality by removing poor or bad image features such as "low resolution" or "bad details". You can also use a negative prompt to remove or modify the content and style of an image.
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html#generator) object is used to enable reproducibility in a pipeline by setting a manual seed. However, you can also use a `Generator` to generate batches of images and iteratively improve on an image generated from a seed as detailed in the [Improve image quality with deterministic generation](reusing_seeds) guide.
231
+
232
+
You can set a seed and `Generator` as shown below. Creating an image with a `Generator` should return the same result each time now, instead of randomly generating a new image.
"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
244
+
generator=generator,
245
+
).images[0]
246
+
```
247
+
248
+
## Control image generation
249
+
250
+
There are several ways to exert more control over how an image is generated outside of configuring a pipeline's parameters, such as prompt weighting and ControlNet models.
251
+
252
+
### Prompt weighting
253
+
254
+
Prompt weighting is a technique for increasing or decreasing the importance of concepts in a prompt to emphasize or minimize certain features in an image. We recommend using the [Compel](https://github.com/damian0815/compel) library to help you generate the weighted prompt embeddings.
255
+
256
+
<Tip>
257
+
258
+
Learn how to create the prompt embeddings in the [Prompt weighting](weighted_prompts) guide. This example focuses on how to use the prompt embeddings in the pipeline.
259
+
260
+
</Tip>
261
+
262
+
Once you've created the embeddings, you can pass them to the `prompt_embeds` (and `negative_prompt_embeds` if you're using a negative prompt) parameter in the pipeline.
prompt_emebds=prompt_embeds, # generated from Compel
273
+
negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
274
+
).images[0]
275
+
```
276
+
277
+
### ControlNet
278
+
279
+
As you saw in the [ControlNet](#controlnet) section, these models offer a more flexible and accurate way to generate images by incorporating an additional conditioning image input. The ControlNet is pretrained on the conditioning image input to generate new images that resemble it. For example, if you take a ControlNet pretrained on depth maps, you can give the model a depth map as a conditioning input and it'll generate an image that preserves the spatial information of it. This is quicker and easier than specifying the depth information in a prompt. You can even combine multiple conditioning inputs with a [MultiControlNet](controlnet#multicontrolnet)!
280
+
281
+
There are many types of conditioning inputs you can use, and 🤗 Diffusers supports ControlNet for Stable Diffusion and SDXL models. Take a look at the more comprehensive [ControlNet](controlnet) guide to learn how you can use these models.
282
+
283
+
## Optimize
284
+
285
+
Diffusion models are large, and the iterative nature of denoising an image is computationally expensive and intensive. But this doesn't mean you need access to powerful - or even many - GPUs to use them. There are many things you can do to run diffusion models on consumer and free-tier resources. For example, you can load model weights in half-precision to save GPU memory and increase speed or offloading the entire model to the GPU to preserve even more memory.
286
+
287
+
PyTorch 2.0 also supports a more memory-efficient attention mechanism called *scaled dot product attention* that is automatically enabled if you're using PyTorch 2.0. You can combine this with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) to speed your code up even more:
For more tips on how to optimize your code to save memory and speed up inference, read the [Memory and speed](./optimization/fp16) and [Torch 2.0](./optimization/torch2.0) guides.
0 commit comments