This is a port of https://github.com/resemble-ai/chatterbox to vLLM. Why?
- Improved performance and more efficient use of GPU memory.
- Early benchmarks show ~4x speedup in generation toks/s without batching, and over 10x with batching. This is a significant improvement over the original Chatterbox implementation, which was bottlenecked by unnecessary CPU-GPU sync/transfers within the HF transformers implementation.
- More rigorous benchmarking is WIP, but will likely come after batching is fully fleshed out.
- Easier integration with state-of-the-art inference infrastructure.
DISCLAIMER: THIS IS A PERSONAL PROJECT and is not affiliated with my employer or any other corporate entity in any way. The project is based solely on publicly-available information. All opinions are my own and do not necessarily represent the views of my employer.
- ✅ Basic speech cloning with audio and text conditioning.
- ✅ Outputs match the quality of the original Chatterbox implementation.
- ✅ Context Free Guidance (CFG) is implemented.
- Due to a vLLM limitation, CFG can not be tuned on a per-request basis and can only be configured via the
CHATTERBOX_CFG_SCALEenvironment variable.
- Due to a vLLM limitation, CFG can not be tuned on a per-request basis and can only be configured via the
- ✅ Exaggeration control is implemented.
- ✅ vLLM batching is implemented and produces a significant speedup.
- ℹ️ Project uses vLLM internal APIs and extremely hacky workarounds to get things done.
- Refactoring to the idomatic vLLM way of doing things is WIP, but will require some changes to vLLM.
- Until then, this is a Rube Goldberg machine that will likely only work with vLLM 0.9.2.
- Follow vllm-project/vllm#21989 for updates.
- ℹ️ Substantial refactoring is needed to further clean up unnecessary workarounds and code paths.
- ℹ️ Server API is not implemented and will likely be out-of-scope for this project.
- ❌ Learned speech positional embeddings are not applied, pending support in vLLM. However, this doesn't seem to be causing a very noticeable degradation in quality.
- ❌ APIs are not yet stable and may change.
- ❌ Benchmarks and performance optimizations are not yet implemented.
uv venv
source .venv/bin/activate
uv sync
The package should automatically download the correct model weights from the Hugging Face Hub. You should then be able to run python example-tts.py to generate audio samples.
If you encounter CUDA issues, try resetting the venv and using uv pip install -e . instead of uv sync.
To run a benchmark, tweak and run benchmark.py.
The following results were obtained with batching on a 6.6k-word input (docs/benchmark-text-1.txt), generating ~40min of audio.
Notes:
- I'm not entirely sure what the toks/s figures from vLLM are showing - the figures probably aren't directly comparable to others, but the results speak for themselves.
- With vLLM, the T3 model is no longer the bottleneck
- Vast majority of time is now spent on the S3Gen model, which is not ported/portable to vLLM. This currently uses the original reference implementation from the Chatterbox repo, so there's potential for integrating some of the other community optimizations here.
- This also means the vLLM section of the model never fully ramps to its peak throughput in these benchmarks.
- Benchmarks are done without CUDA graphs, as that is currently causing correctness issues.
- There's some issues with my very rudimentary chunking logic, which is causing some occasional artifacts in output quality.
System Specs:
- RTX 3090: 24GB VRAM
- AMD Ryzen 9 7900X @ 5.70GHz
- 128GB DDR5 4800 MT/s
Settings & Results:
- Input text:
docs/benchmark-text-1.txt(6.6k words) - Input audio:
docs/audio-sample-03.mp3 - Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8
- CUDA graphs disabled, vLLM max memory utilization=0.6
- Generated output length: 39m50s
- Wall time: 2m30s
- Generation time (without model startup time): 133s
- Time spent in T3 Llama token generation: 20.6s
- Time spent in S3Gen waveform generation: 111s
Logs:
[BENCHMARK] Text chunked into 154 chunks
[config.py:1472] Using max model len 1200
[default_loader.py:272] Loading weights took 0.16 seconds
[gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.215037 seconds
[gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.
[gpu_worker.py:232] Available KV cache memory: 12.59 GiB
[kv_cache_utils.py:716] GPU KV cache size: 110,000 tokens
[kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 91.67x
[BENCHMARK] Model loaded in 7.156545400619507 seconds
Adding requests: 100%|████| 40/40 [00:00<00:00, 1686.08it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00, 7.73it/s, est. speed input: 1487.13 toks/s, output: 3060.03 toks/s]
[T3] Speech Token Generation time: 5.20s
[S3Gen] Wavform Generation time: 29.09s
Adding requests: 100%|████| 40/40 [00:00<00:00, 1832.95it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00, 7.61it/s, est. speed input: 1522.47 toks/s, output: 3130.34 toks/s]
[T3] Speech Token Generation time: 5.28s
[S3Gen] Wavform Generation time: 30.40s
Adding requests: 100%|████| 40/40 [00:00<00:00, 1801.83it/s]
Processed prompts: 100%|████| 40/40 [00:05<00:00, 7.65it/s, est. speed input: 1326.87 toks/s, output: 2912.80 toks/s]
[T3] Speech Token Generation time: 5.25s
[S3Gen] Wavform Generation time: 28.37s
Adding requests: 100%|████| 34/34 [00:00<00:00, 1780.35it/s]
Processed prompts: 100%|████| 34/34 [00:04<00:00, 7.09it/s, est. speed input: 1274.34 toks/s, output: 2582.66 toks/s]
[T3] Speech Token Generation time: 4.82s
[S3Gen] Wavform Generation time: 23.74s
[BENCHMARK] Generation completed in 132.7742235660553 seconds
[BENCHMARK] Audio saved to benchmark.mp3
[BENCHMARK] Total time: 144.99638843536377 seconds
real 2m30.700s
user 2m54.372s
sys 0m2.205s
System Specs:
- RTX 3060ti: 8GB VRAM
- Intel i7-7700K @ 4.20GHz
- 32GB DDR4 2133 MT/s
Settings & Results:
- Input text:
docs/benchmark-text-1.txt(6.6k words) - Input audio:
docs/audio-sample-03.mp3 - Exaggeration: 0.5, CFG: 0.5, Temperature: 0.8
- CUDA graphs disabled, vLLM max memory utilization=0.6
- Generated output length: 40m15s
- Wall time: 4m26s
- Generation time (without model startup time): 238s
- Time spent in T3 Llama token generation: 36.4s
- Time spent in S3Gen waveform generation: 201s
Logs:
[BENCHMARK] Text chunked into 154 chunks.
INFO [config.py:1472] Using max model len 1200
INFO [default_loader.py:272] Loading weights took 0.39 seconds
INFO [gpu_model_runner.py:1801] Model loading took 1.0107 GiB and 0.497231 seconds
INFO [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 241 conditionals items of the maximum feature size.
INFO [gpu_worker.py:232] Available KV cache memory: 3.07 GiB
INFO [kv_cache_utils.py:716] GPU KV cache size: 26,816 tokens
INFO [kv_cache_utils.py:720] Maximum concurrency for 1,200 tokens per request: 22.35x
Adding requests: 100%|████| 40/40 [00:00<00:00, 947.42it/s]
Processed prompts: 100%|████| 40/40 [00:09<00:00, 4.15it/s, est. speed input: 799.18 toks/s, output: 1654.94 toks/s]
[T3] Speech Token Generation time: 9.68s
[S3Gen] Wavform Generation time: 53.66s
Adding requests: 100%|████| 40/40 [00:00<00:00, 858.75it/s]
Processed prompts: 100%|████| 40/40 [00:08<00:00, 4.69it/s, est. speed input: 938.19 toks/s, output: 1874.97 toks/s]
[T3] Speech Token Generation time: 8.58s
[S3Gen] Wavform Generation time: 53.86s
Adding requests: 100%|████| 40/40 [00:00<00:00, 815.60it/s]
Processed prompts: 100%|████| 40/40 [00:09<00:00, 4.19it/s, est. speed input: 726.62 toks/s, output: 1531.24 toks/s]
[T3] Speech Token Generation time: 9.60s
[S3Gen] Wavform Generation time: 49.89s
Adding requests: 100%|████| 34/34 [00:00<00:00, 938.61it/s]
Processed prompts: 100%|████| 34/34 [00:08<00:00, 3.98it/s, est. speed input: 714.68 toks/s, output: 1439.42 toks/s]
[T3] Speech Token Generation time: 8.59s
[S3Gen] Wavform Generation time: 43.58s
[BENCHMARK] Generation completed in 238.42230987548828 seconds
[BENCHMARK] Audio saved to benchmark.mp3
[BENCHMARK] Total time: 259.1808190345764 seconds
real 4m26.803s
user 4m42.393s
sys 0m4.285s
I could not find an official explanation of the Chatterbox architecture, so below is my best explanation based on the codebase. Chatterbox broadly follows the CosyVoice architecture, applying intermediate fusion multimodal conditioning to a 0.5B parameter Llama model.
vLLM does not support CFG natively, so substantial hacks were needs to make it work. At a high level, we trick vLLM into thinking the model has double the hidden dimension size as it actually does, then splitting and restacking the states to invoke Llama with double the original batch size. This does pose a risk that vLLM will underestimate the memory requirements of the model - more research is needed into whether vLLM's initial profiling pass will capture this nuance.