Skip to content

Add AudioLDM 2 #4549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Aug 21, 2023
Merged

Add AudioLDM 2 #4549

merged 60 commits into from
Aug 21, 2023

Conversation

sanchit-gandhi
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi commented Aug 9, 2023

What does this PR do?

Adds AudioLDM 2 from the paper AudioLDM 2: A General Framework for Audio, Music, and Speech Generation

Architecture

  1. CLAP: prompt embedding from text input
  2. Flan-T5 Encoder: second prompt embedding from text input
  3. Projection model: project the CLAP and T5 prompt embeddings to a shared space and insert special SOS/EOS tokens
  4. Language model: generate 8 new hidden-states, conditional on the projected hidden-states. Use these as the final prompt embeddings for the diffusion model
  5. UNet, Scheduler: custom UNet architecture (see below)
  6. VAE: convert the final latents to a mel-spectrogram
  7. Vocoder: convert the mel-spectrogram to an audio waveform

Steps 5-7 is the same as AudioLDM. The remainder are new.

Diagram:

UNet

The vanilla UNet 2D cross-attention layer looks as follows:

  • Resnet Block (hidden_states)
  • Cross-Attention Transformer Block (hidden_states, encoder_hidden_states)

=> this is the architecture that is used in diffusers when we run the UNet forward with arguments of (hidden_states, encoder_hidden_states)

AudioLDM 2 extends the vanilla UNet architecture to use an additional self-attention layer and two cross-attention layers:

  • Resnet Block (hidden_states)
  • Self-Attention Transformer Block with double self-attention (hidden_states)
  • Cross-Attention Transformer Block 1 (hidden_states, encoder_hidden_states_1)
  • Cross-Attention Transformer Block 2 (hidden_states, encoder_hidden_states_2)

=> here we use a different set of encoder hidden-states for cross-attention blocks 1 and 2. The first hidden-states are those obtained from the T5 model. The second hidden-states are those generated from the language model. Also, we don’t want to pass either of these encoder hidden-states to the self-attention layer, since it uses double self-attention.

Checklist

Weight conversion:

  • UNet
  • VAE
  • CLAP
  • Flan-T5 (HF Transformers)
  • GPT2 (HF Transformers)
  • HiFiGAN
  • Audio MAE embeddings / projection layers

Forward pass:

  • UNet
  • VAE
  • CLAP
  • Flan-T5
  • GPT2
  • HiFiGAN
  • Audio MAE embeddings / projection layers

Pipeline:

  • Add pipeline
  • Add tests
  • Write docs

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 9, 2023

The documentation is not available anymore as the PR was closed or merged.

@tin2tin
Copy link

tin2tin commented Aug 10, 2023

If possible, then maybe consider doing a pruned version of the model, so it'll be able to run on 6 GB of VRAM?

@sanchit-gandhi
Copy link
Contributor Author

It's just the slow integration tests and docs to go here! In the interest of time, would you be able to do a first pass of this @sayakpaul @williamberman to confirm that you're happy with the diffusers design? I can then go full send on finishing the last tests and tidying it up for merge tomorrow 🚀 Thanks both for your help so far!

)


class AudioLDM2UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good if we need to make a separate class because this is a one off pipeline. I think it's possible @patrickvonplaten might have a different opinion but I don't think it's worth blocking because we can merge into the existing unet after the fact if we need to.

So let's just move forward here and I'll file an issue asking patrick to double check he's ok with it being a separate class when he's back from vacation :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4658 follow up issue

Copy link
Contributor

@williamberman williamberman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small requests but looks good!

@sanchit-gandhi
Copy link
Contributor Author

sanchit-gandhi commented Aug 18, 2023

Alright this is working nicely for the base, large and music variants of AudioLDM2! The TTS checkpoints require a VITS text encoder, which will be merged as part of huggingface/transformers#24085 Waiting on this before converting the TTS checkpoints. I've written the AudioLDM2 pipeline in such a way that the TTS models should be compatible directly! (possibly with some minor updates)

@sanchit-gandhi sanchit-gandhi merged commit 7a24977 into huggingface:main Aug 21, 2023
@patrickvonplaten
Copy link
Contributor

Great PR and great reviews here - nice!

@tuanh123789 tuanh123789 mentioned this pull request Oct 12, 2023
6 tasks
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* from audioldm

* unet down + mid

* vae, clap, flan-t5

* start sequence audio mae

* iterate on audioldm encoder

* finish encoder

* finish weight conversion

* text pre-processing

* gpt2 pre-processing

* fix projection model

* working

* unet equivalence

* finish in base

* add unet cond

* finish unet

* finish custom unet

* start clean-up

* revert base unet changes

* refactor pre-processing

* tests: from audioldm

* fix some tests

* more fixes

* iterate on tests

* make fix copies

* harden fast tests

* slow integration tests

* finish tests

* update checkpoint

* update copyright

* docs

* remove outdated method

* add docstring

* make style

* remove decode latents

* enable cpu offload

* (text_encoder_1, tokenizer_1) -> (text_encoder, tokenizer)

* more clean up

* more refactor

* build pr docs

* Update docs/source/en/api/pipelines/audioldm2.md

Co-authored-by: Sayak Paul <[email protected]>

* small clean

* tidy conversion

* update for large checkpoint

* generate -> generate_language_model

* full clap model

* shrink clap-audio in tests

* fix large integration test

* fix fast tests

* use generation config

* make style

* update docs

* finish docs

* finish doc

* update tests

* fix last test

* syntax

* finalise tests

* refactor projection model in prep for TTS

* fix fast tests

* style

---------

Co-authored-by: Sayak Paul <[email protected]>
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
* from audioldm

* unet down + mid

* vae, clap, flan-t5

* start sequence audio mae

* iterate on audioldm encoder

* finish encoder

* finish weight conversion

* text pre-processing

* gpt2 pre-processing

* fix projection model

* working

* unet equivalence

* finish in base

* add unet cond

* finish unet

* finish custom unet

* start clean-up

* revert base unet changes

* refactor pre-processing

* tests: from audioldm

* fix some tests

* more fixes

* iterate on tests

* make fix copies

* harden fast tests

* slow integration tests

* finish tests

* update checkpoint

* update copyright

* docs

* remove outdated method

* add docstring

* make style

* remove decode latents

* enable cpu offload

* (text_encoder_1, tokenizer_1) -> (text_encoder, tokenizer)

* more clean up

* more refactor

* build pr docs

* Update docs/source/en/api/pipelines/audioldm2.md

Co-authored-by: Sayak Paul <[email protected]>

* small clean

* tidy conversion

* update for large checkpoint

* generate -> generate_language_model

* full clap model

* shrink clap-audio in tests

* fix large integration test

* fix fast tests

* use generation config

* make style

* update docs

* finish docs

* finish doc

* update tests

* fix last test

* syntax

* finalise tests

* refactor projection model in prep for TTS

* fix fast tests

* style

---------

Co-authored-by: Sayak Paul <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants