Skip to content

Commit 7a24977

Browse files
Add AudioLDM 2 (huggingface#4549)
* from audioldm * unet down + mid * vae, clap, flan-t5 * start sequence audio mae * iterate on audioldm encoder * finish encoder * finish weight conversion * text pre-processing * gpt2 pre-processing * fix projection model * working * unet equivalence * finish in base * add unet cond * finish unet * finish custom unet * start clean-up * revert base unet changes * refactor pre-processing * tests: from audioldm * fix some tests * more fixes * iterate on tests * make fix copies * harden fast tests * slow integration tests * finish tests * update checkpoint * update copyright * docs * remove outdated method * add docstring * make style * remove decode latents * enable cpu offload * (text_encoder_1, tokenizer_1) -> (text_encoder, tokenizer) * more clean up * more refactor * build pr docs * Update docs/source/en/api/pipelines/audioldm2.md Co-authored-by: Sayak Paul <[email protected]> * small clean * tidy conversion * update for large checkpoint * generate -> generate_language_model * full clap model * shrink clap-audio in tests * fix large integration test * fix fast tests * use generation config * make style * update docs * finish docs * finish doc * update tests * fix last test * syntax * finalise tests * refactor projection model in prep for TTS * fix fast tests * style --------- Co-authored-by: Sayak Paul <[email protected]>
1 parent 74d902e commit 7a24977

File tree

12 files changed

+4350
-0
lines changed

12 files changed

+4350
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,8 @@
190190
title: Audio Diffusion
191191
- local: api/pipelines/audioldm
192192
title: AudioLDM
193+
- local: api/pipelines/audioldm2
194+
title: AudioLDM 2
193195
- local: api/pipelines/auto_pipeline
194196
title: AutoPipeline
195197
- local: api/pipelines/consistency_models
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# AudioLDM 2
14+
15+
AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734)
16+
by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate
17+
text-conditional sound effects, human speech and music.
18+
19+
Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2
20+
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two
21+
text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap)
22+
and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings
23+
are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2ProjectionModel).
24+
A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively
25+
predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding
26+
vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel)
27+
of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention
28+
conditioning, as in most other LDMs.
29+
30+
The abstract of the paper is the following:
31+
32+
*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.*
33+
34+
This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be
35+
found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2).
36+
37+
## Tips
38+
39+
### Choosing a checkpoint
40+
41+
AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. See table below for details on the three official checkpoints:
42+
43+
| Checkpoint | Task | Model Size | Training Data / h |
44+
|-----------------------------------------------------------------|---------------|------------|-------------------|
45+
| [audioldm2](https://huggingface.co/cvssp/audioldm2) | Text-to-audio | 1.1B | 1150k |
46+
| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 1.1B | 665k |
47+
| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 1.5B | 1150k |
48+
49+
### Constructing a prompt
50+
51+
* Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream").
52+
* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
53+
* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality."
54+
55+
### Controlling inference
56+
57+
* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
58+
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
59+
60+
### Evaluating generated waveforms:
61+
62+
* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
63+
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
64+
65+
The following example demonstrates how to construct good music generation using the aforementioned tips:
66+
67+
```python
68+
import scipy
69+
import torch
70+
from diffusers import AudioLDM2Pipeline
71+
72+
# load the best weights for music generation
73+
repo_id = "cvssp/audioldm2-music"
74+
pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
75+
pipe = pipe.to("cuda")
76+
77+
# define the prompts
78+
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
79+
negative_prompt = "Low quality."
80+
81+
# set the seed
82+
generator = torch.Generator("cuda").manual_seed(0)
83+
84+
# run the generation
85+
audio = pipe(
86+
prompt,
87+
negative_prompt=negative_prompt,
88+
num_inference_steps=200,
89+
audio_length_in_s=10.0,
90+
num_waveforms_per_prompt=3,
91+
).audios
92+
93+
# save the best audio sample (index 0) as a .wav file
94+
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
95+
```
96+
97+
<Tip>
98+
99+
Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between
100+
scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines)
101+
section to learn how to efficiently load the same components into multiple pipelines.
102+
103+
</Tip>
104+
105+
## AudioLDM2Pipeline
106+
[[autodoc]] AudioLDM2Pipeline
107+
- all
108+
- __call__
109+
110+
## AudioLDM2ProjectionModel
111+
[[autodoc]] AudioLDM2ProjectionModel
112+
- forward
113+
114+
## AudioLDM2UNet2DConditionModel
115+
[[autodoc]] AudioLDM2UNet2DConditionModel
116+
- forward

0 commit comments

Comments
 (0)