Mystfit
diff --git a/‎docs/source/en/_toctree.yml‎
Lines changed: 24 additions & 2 deletions b/‎docs/source/en/_toctree.yml‎
Lines changed: 24 additions & 2 deletions
diff --git a/‎docs/source/en/api/models.mdx‎
Lines changed: 0 additions & 107 deletions b/‎docs/source/en/api/models.mdx‎
Lines changed: 0 additions & 107 deletions
diff --git a/‎docs/source/en/api/models/autoencoderkl.mdx‎
Lines changed: 31 additions & 0 deletions b/‎docs/source/en/api/models/autoencoderkl.mdx‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/controlnet.mdx‎
Lines changed: 23 additions & 0 deletions b/‎docs/source/en/api/models/controlnet.mdx‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/overview.mdx‎
Lines changed: 12 additions & 0 deletions b/‎docs/source/en/api/models/overview.mdx‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/prior_transformer.mdx‎
Lines changed: 16 additions & 0 deletions b/‎docs/source/en/api/models/prior_transformer.mdx‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/transformer2d.mdx‎
Lines changed: 29 additions & 0 deletions b/‎docs/source/en/api/models/transformer2d.mdx‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/transformer_temporal.mdx‎
Lines changed: 11 additions & 0 deletions b/‎docs/source/en/api/models/transformer_temporal.mdx‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/unet.mdx‎
Lines changed: 13 additions & 0 deletions b/‎docs/source/en/api/models/unet.mdx‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎docs/source/en/api/models/unet2d-cond.mdx‎
Lines changed: 19 additions & 0 deletions b/‎docs/source/en/api/models/unet2d-cond.mdx‎
Lines changed: 19 additions & 0 deletions
@@ -132,8 +132,6 @@
   title: Conceptual Guides
 - sections:
   - sections:
-    - local: api/models
-      title: Models
     - local: api/attnprocessor
       title: Attention Processor
     - local: api/diffusion_pipeline
@@ -151,6 +149,30 @@
     - local: api/image_processor
       title: VAE Image Processor
     title: Main Classes
+  - sections:
+    - local: api/models/overview
+      title: Overview
+    - local: api/models/unet
+      title: UNet1DModel
+    - local: api/models/unet2d
+      title: UNet2DModel
+    - local: api/models/unet2d-cond
+      title: UNet2DConditionModel
+    - local: api/models/unet3d-cond
+      title: UNet3DConditionModel
+    - local: api/models/vq
+      title: VQModel
+    - local: api/models/autoencoderkl
+      title: AutoencoderKL
+    - local: api/models/transformer2d
+      title: Transformer2D
+    - local: api/models/transformer_temporal
+      title: Transformer Temporal
+    - local: api/models/prior_transformer
+      title: Prior Transformer
+    - local: api/models/controlnet
+      title: ControlNet
+    title: Models
   - sections:
     - local: api/pipelines/overview
       title: Overview
 
@@ -0,0 +1,31 @@
+# AutoencoderKL
+
+The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
+
+The abstract from the paper is:
+
+*How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.*
+
+## AutoencoderKL
+
+[[autodoc]] AutoencoderKL
+
+## AutoencoderKLOutput
+
+[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
+
+## DecoderOutput
+
+[[autodoc]] models.vae.DecoderOutput
+
+## FlaxAutoencoderKL
+
+[[autodoc]] FlaxAutoencoderKL
+
+## FlaxAutoencoderKLOutput
+
+[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
+
+## FlaxDecoderOutput
+
+[[autodoc]] models.vae_flax.FlaxDecoderOutput
@@ -0,0 +1,23 @@
+# ControlNet
+
+The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
+
+The abstract from the paper is:
+
+*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
+
+## ControlNetModel
+
+[[autodoc]] ControlNetModel
+
+## ControlNetOutput
+
+[[autodoc]] models.controlnet.ControlNetOutput
+
+## FlaxControlNetModel
+
+[[autodoc]] FlaxControlNetModel
+
+## FlaxControlNetOutput
+
+[[autodoc]] models.controlnet_flax.FlaxControlNetOutput
@@ -0,0 +1,12 @@
+# Models
+
+🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\).
+
+All models are built from the base [`ModelMixin`] class which is a [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub.
+
+## ModelMixin
+[[autodoc]] ModelMixin
+
+## FlaxModelMixin
+
+[[autodoc]] FlaxModelMixin
@@ -0,0 +1,16 @@
+# Prior Transformer
+
+The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents
+](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
+
+The abstract from the paper is:
+
+*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
+
+## PriorTransformer
+
+[[autodoc]] PriorTransformer
+
+## PriorTransformerOutput
+
+[[autodoc]] models.prior_transformer.PriorTransformerOutput
@@ -0,0 +1,29 @@
+# Transformer2D
+
+A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
+
+When the input is **continuous**:
+
+1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`.
+2. Apply the Transformer blocks in the standard way.
+3. Reshape to image.
+
+When the input is **discrete**:
+
+<Tip>
+
+It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked.
+
+</Tip>
+
+1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings.
+2. Apply the Transformer blocks in the standard way.
+3. Predict classes of unnoised image.
+
+## Transformer2DModel
+
+[[autodoc]] Transformer2DModel
+
+## Transformer2DModelOutput
+
+[[autodoc]] models.transformer_2d.Transformer2DModelOutput
@@ -0,0 +1,11 @@
+# Transformer Temporal
+
+A Transformer model for video-like data.
+
+## TransformerTemporalModel
+
+[[autodoc]] models.transformer_temporal.TransformerTemporalModel
+
+## TransformerTemporalModelOutput
+
+[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput
@@ -0,0 +1,13 @@
+# UNet1DModel
+
+The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 1D UNet model.
+
+The abstract from the paper is:
+
+*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
+
+## UNet1DModel
+[[autodoc]] UNet1DModel
+
+## UNet1DOutput
+[[autodoc]] models.unet_1d.UNet1DOutput
@@ -0,0 +1,19 @@
+# UNet2DConditionModel
+
+The [UNet](https://huggingface.co/papers/1505.04597) model was originally introduced by Ronneberger et al for biomedical image segmentation, but it is also commonly used in 🤗 Diffusers because it outputs images that are the same size as the input. It is one of the most important components of a diffusion system because it facilitates the actual diffusion process. There are several variants of the UNet model in 🤗 Diffusers, depending on it's number of dimensions and whether it is a conditional model or not. This is a 2D UNet conditional model.
+
+The abstract from the paper is:
+
+*There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.*
+
+## UNet2DConditionModel
+[[autodoc]] UNet2DConditionModel
+
+## UNet2DConditionOutput
+[[autodoc]] models.unet_2d_condition.UNet2DConditionOutput
+
+## FlaxUNet2DConditionModel
+[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionModel
+
+## FlaxUNet2DConditionOutput
+[[autodoc]] models.unet_2d_condition_flax.FlaxUNet2DConditionOutput