EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.

🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: EzAudio Space

🎮 EzAudio-ControlNet Demo is available: EzAudio-ControlNet Space

News

2025.05 EzAudio has been accepted for an oral presentation at InternSpeech 2025.

Installation

Clone the repository:

git clone [email protected]:haidog-yaqub/EzAudio.git

Install the dependencies:

cd EzAudio
pip install -r requirements.txt

Download checkponts (Optional): https://huggingface.co/OpenSound/EzAudio

Usage

You can use the model with the following code:

from api.ezaudio import EzAudio
import torch
import soundfile as sf

# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ezaudio = EzAudio(model_name='s3_xl', device=device)

# text to audio genertation
prompt = "a dog barking in the distance"
sr, audio = ezaudio.generate_audio(prompt)
sf.write(f'{prompt}.wav', audio, sr)

# audio inpainting
prompt = "A train passes by, blowing its horns"
original_audio = 'egs/edit_example.wav'
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
                                  mask_start=1, mask_length=5)
sf.write(f'{prompt}_edit.wav', audio, sr)

ControlNet Usage:

from api.ezaudio import EzAudio
import torch
import soundfile as sf

# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
controlnet = EzAudio_ControlNet(model_name='energy', device=device)

prompt = 'dog barking'
# path for audio reference
audio_path = 'egs/reference.mp3'

sr, audio = controlnet.generate_audio(prompt, audio_path=audio_path)
sf.write(f"{prompt}_control.wav", audio, samplerate=sr)

Training

Autoencoder

Refer to the VAE training section in our work SoloAudio

T2A Diffusion Model

Prepare your data (see example in src/dataset/meta_example.csv), then run:

cd src
accelerate launch train.py

Todo

Release Gradio Demo along with checkpoints EzAudio Space
Release ControlNet Demo along with checkpoints EzAudio ControlNet Space
Release inference code
Release training pipeline and dataset
Improve API and support automatic ckpts downloading

Reference

If you find the code useful for your research, please consider citing:

@article{hai2024ezaudio,
  title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
  author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
  journal={arXiv preprint arXiv:2409.10819},
  year={2024}
}

Acknowledgement

Some codes are borrowed from or inspired by: U-Vit, Pixel-Art, Huyuan-DiT, and Stable Audio.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
api		api
arts		arts
audiotools		audiotools
ckpts		ckpts
egs		egs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
controlnet_demo.py		controlnet_demo.py
requirements.txt		requirements.txt
t2a_demo.py		t2a_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

News

Installation

Usage

Training

Autoencoder

T2A Diffusion Model

Todo

Reference

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

haidog-yaqub/EzAudio

Folders and files

Latest commit

History

Repository files navigation

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

News

Installation

Usage

Training

Autoencoder

T2A Diffusion Model

Todo

Reference

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages