Skip to content

Commit 174dcd6

Browse files
authored
[docs] Model API (huggingface#3562)
* add modelmixin and unets * remove old model page * minor fixes * fix unet2dcondition * add vqmodel and autoencoderkl * add rest of models * fix autoencoderkl path * fix toctree * fix toctree again * apply feedback * apply feedback * fix copies * fix controlnet copy * fix copies
1 parent cdf2ae8 commit 174dcd6

30 files changed

+928
-653
lines changed

docs/source/en/_toctree.yml

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,6 @@
132132
title: Conceptual Guides
133133
- sections:
134134
- sections:
135-
- local: api/models
136-
title: Models
137135
- local: api/attnprocessor
138136
title: Attention Processor
139137
- local: api/diffusion_pipeline
@@ -151,6 +149,30 @@
151149
- local: api/image_processor
152150
title: VAE Image Processor
153151
title: Main Classes
152+
- sections:
153+
- local: api/models/overview
154+
title: Overview
155+
- local: api/models/unet
156+
title: UNet1DModel
157+
- local: api/models/unet2d
158+
title: UNet2DModel
159+
- local: api/models/unet2d-cond
160+
title: UNet2DConditionModel
161+
- local: api/models/unet3d-cond
162+
title: UNet3DConditionModel
163+
- local: api/models/vq
164+
title: VQModel
165+
- local: api/models/autoencoderkl
166+
title: AutoencoderKL
167+
- local: api/models/transformer2d
168+
title: Transformer2D
169+
- local: api/models/transformer_temporal
170+
title: Transformer Temporal
171+
- local: api/models/prior_transformer
172+
title: Prior Transformer
173+
- local: api/models/controlnet
174+
title: ControlNet
175+
title: Models
154176
- sections:
155177
- local: api/pipelines/overview
156178
title: Overview

docs/source/en/api/models.mdx

Lines changed: 0 additions & 107 deletions
This file was deleted.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# AutoencoderKL
2+
3+
The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
4+
5+
The abstract from the paper is:
6+
7+
*How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.*
8+
9+
## AutoencoderKL
10+
11+
[[autodoc]] AutoencoderKL
12+
13+
## AutoencoderKLOutput
14+
15+
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
16+
17+
## DecoderOutput
18+
19+
[[autodoc]] models.vae.DecoderOutput
20+
21+
## FlaxAutoencoderKL
22+
23+
[[autodoc]] FlaxAutoencoderKL
24+
25+
## FlaxAutoencoderKLOutput
26+
27+
[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
28+
29+
## FlaxDecoderOutput
30+
31+
[[autodoc]] models.vae_flax.FlaxDecoderOutput
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# ControlNet
2+
3+
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
4+
5+
The abstract from the paper is:
6+
7+
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
8+
9+
## ControlNetModel
10+
11+
[[autodoc]] ControlNetModel
12+
13+
## ControlNetOutput
14+
15+
[[autodoc]] models.controlnet.ControlNetOutput
16+
17+
## FlaxControlNetModel
18+
19+
[[autodoc]] FlaxControlNetModel
20+
21+
## FlaxControlNetOutput
22+
23+
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Models
2+
3+
🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\).
4+
5+
All models are built from the base [`ModelMixin`] class which is a [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub.
6+
7+
## ModelMixin
8+
[[autodoc]] ModelMixin
9+
10+
## FlaxModelMixin
11+
12+
[[autodoc]] FlaxModelMixin
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Prior Transformer
2+
3+
The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents
4+
](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
5+
6+
The abstract from the paper is:
7+
8+
*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
9+
10+
## PriorTransformer
11+
12+
[[autodoc]] PriorTransformer
13+
14+
## PriorTransformerOutput
15+
16+
[[autodoc]] models.prior_transformer.PriorTransformerOutput
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Transformer2D
2+
3+
A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
4+
5+
When the input is **continuous**:
6+
7+
1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`.
8+
2. Apply the Transformer blocks in the standard way.
9+
3. Reshape to image.
10+
11+
When the input is **discrete**:
12+
13+
<Tip>
14+
15+
It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked.
16+
17+
</Tip>
18+
19+
1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings.
20+
2. Apply the Transformer blocks in the standard way.
21+
3. Predict classes of unnoised image.
22+
23+
## Transformer2DModel
24+
25+
[[autodoc]] Transformer2DModel
26+
27+
## Transformer2DModelOutput
28+
29+
[[autodoc]] models.transformer_2d.Transformer2DModelOutput
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Transformer Temporal
2+
3+
A Transformer model for video-like data.
4+
5+
## TransformerTemporalModel
6+
7+
[[autodoc]] models.transformer_temporal.TransformerTemporalModel
8+
9+
## TransformerTemporalModelOutput
10+
11+
[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput

docs/source/en/api/models/unet.mdx

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# UNet1DModel
2+
3+
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model.
4+
5+
The abstract from the paper is:
6+
7+
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
8+
9+
## UNet1DModel
10+
[[autodoc]] UNet1DModel
11+
12+
## UNet1DOutput
13+
[[autodoc]] models.unet_1d.UNet1DOutput
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# UNet2DConditionModel
2+
3+
The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model.
4+
5+
The abstract from the paper is:
6+
7+
*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
8+
9+
## UNet2DConditionModel
10+
[[autodoc]] UNet2DConditionModel
11+
12+
## UNet2DConditionOutput
13+
[[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
14+
15+
## FlaxUNet2DConditionModel
16+
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionModel
17+
18+
## FlaxUNet2DConditionOutput
19+
[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput

0 commit comments

Comments
 (0)