Skip to content

Commit 784107b

Browse files
committed
Merge branch 'main' of github.com:movelikeriver/diffusers
2 parents 46743ce + 723c79c commit 784107b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+1398
-704
lines changed
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
name: Run dependency tests
2+
3+
on:
4+
pull_request:
5+
branches:
6+
- main
7+
push:
8+
branches:
9+
- main
10+
11+
concurrency:
12+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
13+
cancel-in-progress: true
14+
15+
jobs:
16+
check_dependencies:
17+
runs-on: ubuntu-latest
18+
steps:
19+
- uses: actions/checkout@v3
20+
- name: Set up Python
21+
uses: actions/setup-python@v4
22+
with:
23+
python-version: "3.7"
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install -e .
28+
pip install pytest
29+
- name: Check for soft dependencies
30+
run: |
31+
pytest tests/others/test_dependencies.py
32+

.github/workflows/pr_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ jobs:
8181
if: ${{ matrix.config.framework == 'pytorch_models' }}
8282
run: |
8383
python -m pytest -n 2 --max-worker-restart=0 --dist=loadfile \
84-
-s -v -k "not Flax and not Onnx" \
84+
-s -v -k "not Flax and not Onnx and not Dependency" \
8585
--make-reports=tests_${{ matrix.config.report }} \
8686
tests/models tests/schedulers tests/others
8787

docs/source/en/_toctree.yml

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,6 @@
132132
title: Conceptual Guides
133133
- sections:
134134
- sections:
135-
- local: api/models
136-
title: Models
137135
- local: api/attnprocessor
138136
title: Attention Processor
139137
- local: api/diffusion_pipeline
@@ -151,6 +149,30 @@
151149
- local: api/image_processor
152150
title: VAE Image Processor
153151
title: Main Classes
152+
- sections:
153+
- local: api/models/overview
154+
title: Overview
155+
- local: api/models/unet
156+
title: UNet1DModel
157+
- local: api/models/unet2d
158+
title: UNet2DModel
159+
- local: api/models/unet2d-cond
160+
title: UNet2DConditionModel
161+
- local: api/models/unet3d-cond
162+
title: UNet3DConditionModel
163+
- local: api/models/vq
164+
title: VQModel
165+
- local: api/models/autoencoderkl
166+
title: AutoencoderKL
167+
- local: api/models/transformer2d
168+
title: Transformer2D
169+
- local: api/models/transformer_temporal
170+
title: Transformer Temporal
171+
- local: api/models/prior_transformer
172+
title: Prior Transformer
173+
- local: api/models/controlnet
174+
title: ControlNet
175+
title: Models
154176
- sections:
155177
- local: api/pipelines/overview
156178
title: Overview

docs/source/en/api/models.mdx

Lines changed: 0 additions & 107 deletions
This file was deleted.
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# AutoencoderKL
2+
3+
The variational autoencoder (VAE) model with KL loss was introduced in [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114v11) by Diederik P. Kingma and Max Welling. The model is used in 🤗 Diffusers to encode images into latents and to decode latent representations into images.
4+
5+
The abstract from the paper is:
6+
7+
*How can we perform efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions, and large datasets? We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even works in the intractable case. Our contributions are two-fold. First, we show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods. Second, we show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model (also called a recognition model) to the intractable posterior using the proposed lower bound estimator. Theoretical advantages are reflected in experimental results.*
8+
9+
## AutoencoderKL
10+
11+
[[autodoc]] AutoencoderKL
12+
13+
## AutoencoderKLOutput
14+
15+
[[autodoc]] models.autoencoder_kl.AutoencoderKLOutput
16+
17+
## DecoderOutput
18+
19+
[[autodoc]] models.vae.DecoderOutput
20+
21+
## FlaxAutoencoderKL
22+
23+
[[autodoc]] FlaxAutoencoderKL
24+
25+
## FlaxAutoencoderKLOutput
26+
27+
[[autodoc]] models.vae_flax.FlaxAutoencoderKLOutput
28+
29+
## FlaxDecoderOutput
30+
31+
[[autodoc]] models.vae_flax.FlaxDecoderOutput
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# ControlNet
2+
3+
The ControlNet model was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. It provides a greater degree of control over text-to-image generation by conditioning the model on additional inputs such as edge maps, depth maps, segmentation maps, and keypoints for pose detection.
4+
5+
The abstract from the paper is:
6+
7+
*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.*
8+
9+
## ControlNetModel
10+
11+
[[autodoc]] ControlNetModel
12+
13+
## ControlNetOutput
14+
15+
[[autodoc]] models.controlnet.ControlNetOutput
16+
17+
## FlaxControlNetModel
18+
19+
[[autodoc]] FlaxControlNetModel
20+
21+
## FlaxControlNetOutput
22+
23+
[[autodoc]] models.controlnet_flax.FlaxControlNetOutput
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# Models
2+
3+
🤗 Diffusers provides pretrained models for popular algorithms and modules to create custom diffusion systems. The primary function of models is to denoise an input sample as modeled by the distribution \\(p_{\theta}(x_{t-1}|x_{t})\\).
4+
5+
All models are built from the base [`ModelMixin`] class which is a [`torch.nn.module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) providing basic functionality for saving and loading models, locally and from the Hugging Face Hub.
6+
7+
## ModelMixin
8+
[[autodoc]] ModelMixin
9+
10+
## FlaxModelMixin
11+
12+
[[autodoc]] FlaxModelMixin
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Prior Transformer
2+
3+
The Prior Transformer was originally introduced in [Hierarchical Text-Conditional Image Generation with CLIP Latents
4+
](https://huggingface.co/papers/2204.06125) by Ramesh et al. It is used to predict CLIP image embeddings from CLIP text embeddings; image embeddings are predicted through a denoising diffusion process.
5+
6+
The abstract from the paper is:
7+
8+
*Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.*
9+
10+
## PriorTransformer
11+
12+
[[autodoc]] PriorTransformer
13+
14+
## PriorTransformerOutput
15+
16+
[[autodoc]] models.prior_transformer.PriorTransformerOutput
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Transformer2D
2+
3+
A Transformer model for image-like data from [CompVis](https://huggingface.co/CompVis) that is based on the [Vision Transformer](https://huggingface.co/papers/2010.11929) introduced by Dosovitskiy et al. The [`Transformer2DModel`] accepts discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
4+
5+
When the input is **continuous**:
6+
7+
1. Project the input and reshape it to `(batch_size, sequence_length, feature_dimension)`.
8+
2. Apply the Transformer blocks in the standard way.
9+
3. Reshape to image.
10+
11+
When the input is **discrete**:
12+
13+
<Tip>
14+
15+
It is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image don't contain a prediction for the masked pixel because the unnoised image cannot be masked.
16+
17+
</Tip>
18+
19+
1. Convert input (classes of latent pixels) to embeddings and apply positional embeddings.
20+
2. Apply the Transformer blocks in the standard way.
21+
3. Predict classes of unnoised image.
22+
23+
## Transformer2DModel
24+
25+
[[autodoc]] Transformer2DModel
26+
27+
## Transformer2DModelOutput
28+
29+
[[autodoc]] models.transformer_2d.Transformer2DModelOutput
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Transformer Temporal
2+
3+
A Transformer model for video-like data.
4+
5+
## TransformerTemporalModel
6+
7+
[[autodoc]] models.transformer_temporal.TransformerTemporalModel
8+
9+
## TransformerTemporalModelOutput
10+
11+
[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput

0 commit comments

Comments
 (0)