Chatterbox TTS on vLLM

This is a port of https://github.com/resemble-ai/chatterbox to vLLM. Why?

Improved performance and more efficient use of GPU memory.
- Early benchmarks show ~4x speedup in generation toks/s without batching, and over 10x with batching. This is a significant improvement over the original Chatterbox implementation, which was bottlenecked by unnecessary CPU-GPU sync/transfers within the HF transformers implementation.
- More rigorous benchmarking is WIP, but will likely come after batching is fully fleshed out.
Easier integration with state-of-the-art inference infrastructure.

DISCLAIMER: THIS IS A PERSONAL PROJECT and is not affiliated with my employer or any other corporate entity in any way. The project is based solely on publicly-available information. All opinions are my own and do not necessarily represent the views of my employer.

Generation Samples

Project Status: Usable and with Benchmark-Topping Throughput

✅ Basic speech cloning with audio and text conditioning.
✅ Outputs match the quality of the original Chatterbox implementation.
✅ Context Free Guidance (CFG) is implemented.
- Due to a vLLM limitation, CFG can not be tuned on a per-request basis and can only be configured via the CHATTERBOX_CFG_SCALE environment variable.
✅ Exaggeration control is implemented.
✅ vLLM batching is implemented and produces a significant speedup.
ℹ️ Project uses vLLM internal APIs and extremely hacky workarounds to get things done.
- Refactoring to the idomatic vLLM way of doing things is WIP, but will require some changes to vLLM.
- Until then, this is a Rube Goldberg machine that will likely only work with vLLM 0.9.2.
- Follow vllm-project/vllm#21989 for updates.
ℹ️ Substantial refactoring is needed to further clean up unnecessary workarounds and code paths.
ℹ️ Server API is not implemented and will likely be out-of-scope for this project.
❌ Learned speech positional embeddings are not applied, pending support in vLLM. However, this doesn't seem to be causing a very noticeable degradation in quality.
❌ APIs are not yet stable and may change.
❌ Benchmarks and performance optimizations are not yet implemented.

Installation

uv venv
source .venv/bin/activate
uv sync

The package should automatically download the correct model weights from the Hugging Face Hub. You should then be able to run python example-tts.py to generate audio samples.

If you encounter CUDA issues, try resetting the venv and using uv pip install -e . instead of uv sync.

Benchmarks

To run a benchmark, tweak and run benchmark.py. The following results were obtained with batching on a 6.6k-word input (docs/benchmark-text-1.txt), generating ~40min of audio.

Notes:

I'm not entirely sure what the toks/s figures from vLLM are showing - the figures probably aren't directly comparable to others, but the results speak for themselves.
With vLLM, the T3 model is no longer the bottleneck
- Vast majority of time is now spent on the S3Gen model, which is not ported/portable to vLLM. This currently uses the original reference implementation from the Chatterbox repo, so there's potential for integrating some of the other community optimizations here.
- This also means the vLLM section of the model never fully ramps to its peak throughput in these benchmarks.
Benchmarks are done without CUDA graphs, as that is currently causing correctness issues.
There's some issues with my very rudimentary chunking logic, which is causing some occasional artifacts in output quality.

Run 1: RTX 3090

System Specs:

RTX 3090: 24GB VRAM
AMD Ryzen 9 7900X @ 5.70GHz
128GB DDR5 4800 MT/s

Settings & Results:

Input text: docs/benchmark-text-1.txt (6.6k words)
Input audio: docs/audio-sample-03.mp3
Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8
CUDA graphs disabled, vLLM max memory utilization=0.6
Generated output length: 39m50s
Wall time: 2m30s
Generation time (without model startup time): 133s
- Time spent in T3 Llama token generation: 20.6s
- Time spent in S3Gen waveform generation: 111s

Logs:

[BENCHMARK] Text chunked into 154 chunks
[config.py:1472] Using max model len 1200
[default_loader.py:272] Loading weights took 0.16 seconds
[gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.215037 seconds
[gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.
[gpu_worker.py:232] Available KV cache memory: 12.59 GiB
[kv_cache_utils.py:716] GPU KV cache size: 110,000 tokens
[kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 91.67x
[BENCHMARK] Model loaded in 7.156545400619507 seconds
Adding requests: 100%|████| 40/40 [00:00<00:00, 1686.08it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00,  7.73it/s, est. speed input: 1487.13 toks/s, output: 3060.03 toks/s]
[T3] Speech Token Generation time: 5.20s
[S3Gen] Wavform Generation time: 29.09s
Adding requests: 100%|████| 40/40 [00:00<00:00, 1832.95it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00,  7.61it/s, est. speed input: 1522.47 toks/s, output: 3130.34 toks/s]
[T3] Speech Token Generation time: 5.28s
[S3Gen] Wavform Generation time: 30.40s
Adding requests: 100%|████| 40/40 [00:00<00:00, 1801.83it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00,  7.65it/s, est. speed input: 1326.87 toks/s, output: 2912.80 toks/s]
[T3] Speech Token Generation time: 5.25s
[S3Gen] Wavform Generation time: 28.37s
Adding requests: 100%|████| 34/34 [00:00<00:00, 1780.35it/s]
Processed prompts: 100%|████| 34/34 [00:04<00:00,  7.09it/s, est. speed input: 1274.34 toks/s, output: 2582.66 toks/s]
[T3] Speech Token Generation time: 4.82s
[S3Gen] Wavform Generation time: 23.74s
[BENCHMARK] Generation completed in 132.7742235660553 seconds
[BENCHMARK] Audio saved to benchmark.mp3
[BENCHMARK] Total time: 144.99638843536377 seconds

real	2m30.700s
user	2m54.372s
sys	0m2.205s

Run 2: RTX 3060ti

System Specs:

RTX 3060ti: 8GB VRAM
Intel i7-7700K @ 4.20GHz
32GB DDR4 2133 MT/s

Settings & Results:

Input text: docs/benchmark-text-1.txt (6.6k words)
Input audio: docs/audio-sample-03.mp3
Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8
CUDA graphs disabled, vLLM max memory utilization=0.6
Generated output length: 40m15s
Wall time: 4m26s
Generation time (without model startup time): 238s
- Time spent in T3 Llama token generation: 36.4s
- Time spent in S3Gen waveform generation: 201s

Logs:

[BENCHMARK] Text chunked into 154 chunks.
INFO [config.py:1472] Using max model len 1200
INFO [default_loader.py:272] Loading weights took 0.39 seconds
INFO [gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.497231 seconds
INFO [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.
INFO [gpu_worker.py:232] Available KV cache memory: 3.07 GiB
INFO [kv_cache_utils.py:716] GPU KV cache size: 26,816 tokens
INFO [kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 22.35x
Adding requests: 100%|████| 40/40 [00:00<00:00, 947.42it/s]
Processed prompts: 100%|████| 40/40 [00:09<00:00,  4.15it/s, est. speed input: 799.18 toks/s, output: 1654.94 toks/s]
[T3] Speech Token Generation time: 9.68s
[S3Gen] Wavform Generation time: 53.66s
Adding requests: 100%|████| 40/40 [00:00<00:00, 858.75it/s]
Processed prompts: 100%|████| 40/40 [00:08<00:00,  4.69it/s, est. speed input: 938.19 toks/s, output: 1874.97 toks/s]
[T3] Speech Token Generation time: 8.58s
[S3Gen] Wavform Generation time: 53.86s
Adding requests: 100%|████| 40/40 [00:00<00:00, 815.60it/s]
Processed prompts: 100%|████| 40/40 [00:09<00:00,  4.19it/s, est. speed input: 726.62 toks/s, output: 1531.24 toks/s]
[T3] Speech Token Generation time: 9.60s
[S3Gen] Wavform Generation time: 49.89s
Adding requests: 100%|████| 34/34 [00:00<00:00, 938.61it/s]
Processed prompts: 100%|████| 34/34 [00:08<00:00,  3.98it/s, est. speed input: 714.68 toks/s, output: 1439.42 toks/s]
[T3] Speech Token Generation time: 8.59s
[S3Gen] Wavform Generation time: 43.58s
[BENCHMARK] Generation completed in 238.42230987548828 seconds
[BENCHMARK] Audio saved to benchmark.mp3
[BENCHMARK] Total time: 259.1808190345764 seconds

real    4m26.803s
user    4m42.393s
sys     0m4.285s

Chatterbox Architecture

I could not find an official explanation of the Chatterbox architecture, so below is my best explanation based on the codebase. Chatterbox broadly follows the CosyVoice architecture, applying intermediate fusion multimodal conditioning to a 0.5B parameter Llama model.

Chatterbox Architecture Diagram

Implementation Notes

CFG Implementation Details

vLLM does not support CFG natively, so substantial hacks were needs to make it work. At a high level, we trick vLLM into thinking the model has double the hidden dimension size as it actually does, then splitting and restacking the states to invoke Llama with double the original batch size. This does pose a risk that vLLM will underestimate the memory requirements of the model - more research is needed into whether vLLM's initial profiling pass will capture this nuance.

vLLM CFG Implementation

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.vscode		.vscode
docs		docs
src/chatterbox_vllm		src/chatterbox_vllm
t3-model		t3-model
.gitignore		.gitignore
.latest-version.generated.txt		.latest-version.generated.txt
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
example-tts.py		example-tts.py
gradio_tts_app.py		gradio_tts_app.py
pyproject.toml		pyproject.toml
upload-package.sh		upload-package.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Chatterbox TTS on vLLM

Generation Samples

Project Status: Usable and with Benchmark-Topping Throughput

Installation

Benchmarks

Run 1: RTX 3090

Run 2: RTX 3060ti

Chatterbox Architecture

Implementation Notes

CFG Implementation Details

About

Uh oh!

Releases

Packages

Languages

License

Alignment-Lab-AI/chatterbox-vllm

Folders and files

Latest commit

History

Repository files navigation

Chatterbox TTS on vLLM

Generation Samples

Project Status: Usable and with Benchmark-Topping Throughput

Installation

Benchmarks

Run 1: RTX 3090

Run 2: RTX 3060ti

Chatterbox Architecture

Implementation Notes

CFG Implementation Details

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages