Skip to content

Commit 8873524

Browse files
authored
[Docs] fix: minor formatting in the Würstchen docs (huggingface#4965)
fix: minor formatting in the docs
1 parent 4191dde commit 8873524

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

docs/source/en/api/pipelines/wuerstchen.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ After the initial paper release, we have improved numerous things in the archite
1818
- Better quality
1919

2020
We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are:
21+
2122
- v2-base
2223
- v2-aesthetic
2324
- v2-interpolated (50% interpolation between v2-base and v2-aesthetic)
@@ -58,7 +59,7 @@ output = pipeline(
5859
).images
5960
```
6061

61-
For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the `prior_pipeline`. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the `decoder_pipeline`. For more details, take a look the [paper](https://huggingface.co/papers/2306.00637).
62+
For explanation purposes, we can also initialize the two main pipelines of Würstchen individually. Würstchen consists of 3 stages: Stage C, Stage B, Stage A. They all have different jobs and work only together. When generating text-conditional images, Stage C will first generate the latents in a very compressed latent space. This is what happens in the `prior_pipeline`. Afterwards, the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space of a VQGAN. These latents can then be decoded by Stage A, which is a VQGAN, into the pixel-space. Stage B & Stage A are both encapsulated in the `decoder_pipeline`. For more details, take a look at the [paper](https://huggingface.co/papers/2306.00637).
6263

6364
```python
6465
import torch
@@ -97,14 +98,15 @@ decoder_output = decoder_pipeline(
9798
```
9899

99100
## Speed-Up Inference
100-
You can make use of ``torch.compile`` function and gain a speed-up of about 2-3x:
101+
You can make use of `torch.compile` function and gain a speed-up of about 2-3x:
101102

102103
```python
103104
pipeline.prior = torch.compile(pipeline.prior, mode="reduce-overhead", fullgraph=True)
104105
pipeline.decoder = torch.compile(pipeline.decoder, mode="reduce-overhead", fullgraph=True)
105106
```
106107

107108
## Limitations
109+
108110
- Due to the high compression employed by Würstchen, generations can lack a good amount
109111
of detail. To our human eye, this is especially noticeable in faces, hands etc.
110112
- **Images can only be generated in 128-pixel steps**, e.g. the next higher resolution

0 commit comments

Comments
 (0)