Add AudioLDM 2 #4549

sanchit-gandhi · 2023-08-09T17:50:04Z

What does this PR do?

Adds AudioLDM 2 from the paper AudioLDM 2: A General Framework for Audio, Music, and Speech Generation

Architecture

CLAP: prompt embedding from text input
Flan-T5 Encoder: second prompt embedding from text input
Projection model: project the CLAP and T5 prompt embeddings to a shared space and insert special SOS/EOS tokens
Language model: generate 8 new hidden-states, conditional on the projected hidden-states. Use these as the final prompt embeddings for the diffusion model
UNet, Scheduler: custom UNet architecture (see below)
VAE: convert the final latents to a mel-spectrogram
Vocoder: convert the mel-spectrogram to an audio waveform

Steps 5-7 is the same as AudioLDM. The remainder are new.

Diagram:

UNet

The vanilla UNet 2D cross-attention layer looks as follows:

Resnet Block (hidden_states)
Cross-Attention Transformer Block (hidden_states, encoder_hidden_states)

=> this is the architecture that is used in diffusers when we run the UNet forward with arguments of (hidden_states, encoder_hidden_states)

AudioLDM 2 extends the vanilla UNet architecture to use an additional self-attention layer and two cross-attention layers:

Resnet Block (hidden_states)
Self-Attention Transformer Block with double self-attention (hidden_states)
Cross-Attention Transformer Block 1 (hidden_states, encoder_hidden_states_1)
Cross-Attention Transformer Block 2 (hidden_states, encoder_hidden_states_2)

=> here we use a different set of encoder hidden-states for cross-attention blocks 1 and 2. The first hidden-states are those obtained from the T5 model. The second hidden-states are those generated from the language model. Also, we don’t want to pass either of these encoder hidden-states to the self-attention layer, since it uses double self-attention.

Checklist

Weight conversion:

Forward pass:

Pipeline:

Add pipeline
Add tests
Write docs

HuggingFaceDocBuilderDev · 2023-08-09T17:57:07Z

The documentation is not available anymore as the PR was closed or merged.

tin2tin · 2023-08-10T13:59:14Z

If possible, then maybe consider doing a pruned version of the model, so it'll be able to run on 6 GB of VRAM?

sanchit-gandhi · 2023-08-16T17:47:17Z

It's just the slow integration tests and docs to go here! In the interest of time, would you be able to do a first pass of this @sayakpaul @williamberman to confirm that you're happy with the diffusers design? I can then go full send on finishing the last tests and tidying it up for merge tomorrow 🚀 Thanks both for your help so far!

tests/pipelines/audioldm2/test_audioldm2.py

src/diffusers/models/attention.py

williamberman · 2023-08-17T16:35:02Z

src/diffusers/pipelines/audioldm2/modeling_audioldm2.py

+        )
+
+
+class AudioLDM2UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):


I think this is good if we need to make a separate class because this is a one off pipeline. I think it's possible @patrickvonplaten might have a different opinion but I don't think it's worth blocking because we can merge into the existing unet after the fact if we need to.

So let's just move forward here and I'll file an issue asking patrick to double check he's ok with it being a separate class when he's back from vacation :)

#4658 follow up issue

src/diffusers/pipelines/audioldm2/pipeline_audioldm2.py

williamberman

Few small requests but looks good!

sanchit-gandhi · 2023-08-18T14:55:16Z

Alright this is working nicely for the base, large and music variants of AudioLDM2! The TTS checkpoints require a VITS text encoder, which will be merged as part of huggingface/transformers#24085 Waiting on this before converting the TTS checkpoints. I've written the AudioLDM2 pipeline in such a way that the TTS models should be compatible directly! (possibly with some minor updates)

patrickvonplaten · 2023-08-22T19:58:36Z

Great PR and great reviews here - nice!

* from audioldm * unet down + mid * vae, clap, flan-t5 * start sequence audio mae * iterate on audioldm encoder * finish encoder * finish weight conversion * text pre-processing * gpt2 pre-processing * fix projection model * working * unet equivalence * finish in base * add unet cond * finish unet * finish custom unet * start clean-up * revert base unet changes * refactor pre-processing * tests: from audioldm * fix some tests * more fixes * iterate on tests * make fix copies * harden fast tests * slow integration tests * finish tests * update checkpoint * update copyright * docs * remove outdated method * add docstring * make style * remove decode latents * enable cpu offload * (text_encoder_1, tokenizer_1) -> (text_encoder, tokenizer) * more clean up * more refactor * build pr docs * Update docs/source/en/api/pipelines/audioldm2.md Co-authored-by: Sayak Paul <[email protected]> * small clean * tidy conversion * update for large checkpoint * generate -> generate_language_model * full clap model * shrink clap-audio in tests * fix large integration test * fix fast tests * use generation config * make style * update docs * finish docs * finish doc * update tests * fix last test * syntax * finalise tests * refactor projection model in prep for TTS * fix fast tests * style --------- Co-authored-by: Sayak Paul <[email protected]>

sanchit-gandhi added 2 commits August 8, 2023 11:59

from audioldm

364bf81

unet down + mid

c214f15

sanchit-gandhi added 21 commits August 10, 2023 15:12

vae, clap, flan-t5

a72144c

start sequence audio mae

251158e

iterate on audioldm encoder

ea48b18

finish encoder

e0fe819

finish weight conversion

c259ee5

text pre-processing

0603675

gpt2 pre-processing

807b8d4

fix projection model

35860e6

working

9dc3d41

unet equivalence

d01e73b

finish in base

5358db5

add unet cond

c8995f0

finish unet

fc67871

finish custom unet

f3bf300

start clean-up

9f10d1f

revert base unet changes

74459f4

refactor pre-processing

bcf13ad

tests: from audioldm

220c391

fix some tests

872e18e

more fixes

227211f

iterate on tests

167c309

sanchit-gandhi requested review from williamberman and sayakpaul August 16, 2023 17:47

sanchit-gandhi commented Aug 16, 2023

View reviewed changes

tests/pipelines/audioldm2/test_audioldm2.py Show resolved Hide resolved

make fix copies

60d3a01

small clean

3c59743

williamberman reviewed Aug 17, 2023

View reviewed changes

src/diffusers/models/attention.py Outdated Show resolved Hide resolved

williamberman reviewed Aug 17, 2023

View reviewed changes

src/diffusers/pipelines/audioldm2/pipeline_audioldm2.py Outdated Show resolved Hide resolved

williamberman approved these changes Aug 17, 2023

View reviewed changes

sanchit-gandhi added 7 commits August 18, 2023 09:20

tidy conversion

22ea9fa

update for large checkpoint

cf522ff

generate -> generate_language_model

925318d

full clap model

9e1895c

shrink clap-audio in tests

d23d0be

fix large integration test

5ace208

fix fast tests

7cfe24f

sanchit-gandhi added 12 commits August 18, 2023 15:59

use generation config

7274f38

make style

40ebb18

update docs

0e16644

finish docs

d03ed7b

finish doc

ee29533

update tests

c362869

fix last test

9d75431

syntax

69346ed

finalise tests

14de296

refactor projection model in prep for TTS

e9328fc

fix fast tests

94873ea

style

9302cb0

sanchit-gandhi merged commit 7a24977 into huggingface:main Aug 21, 2023

tuanh123789 mentioned this pull request Oct 12, 2023

Add AudioLDM2 TTS #5381

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AudioLDM 2 #4549

Add AudioLDM 2 #4549

Uh oh!

sanchit-gandhi commented Aug 9, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Aug 9, 2023 •

edited

Loading

Uh oh!

tin2tin commented Aug 10, 2023 •

edited

Loading

Uh oh!

sanchit-gandhi commented Aug 16, 2023

Uh oh!

Uh oh!

Uh oh!

williamberman Aug 17, 2023

Uh oh!

williamberman Aug 17, 2023

Uh oh!

Uh oh!

williamberman left a comment

Uh oh!

sanchit-gandhi commented Aug 18, 2023 •

edited

Loading

Uh oh!

patrickvonplaten commented Aug 22, 2023

Uh oh!

Uh oh!

		)


		class AudioLDM2UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):

Add AudioLDM 2 #4549

Add AudioLDM 2 #4549

Uh oh!

Conversation

sanchit-gandhi commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Architecture

UNet

Checklist

Uh oh!

HuggingFaceDocBuilderDev commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tin2tin commented Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchit-gandhi commented Aug 16, 2023

Uh oh!

Uh oh!

Uh oh!

williamberman Aug 17, 2023

Choose a reason for hiding this comment

Uh oh!

williamberman Aug 17, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

williamberman left a comment

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi commented Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Aug 22, 2023

Uh oh!

Uh oh!

sanchit-gandhi commented Aug 9, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 9, 2023 •

edited

Loading

tin2tin commented Aug 10, 2023 •

edited

Loading

sanchit-gandhi commented Aug 18, 2023 •

edited

Loading