Seed-VC Real-time Voice Conversion separated for easier deployment and usage.
This project provides a standalone implementation of the real-time voice conversion functionality from Seed-VC, supporting zero-shot voice conversion with low latency.
- Real-time Voice Conversion: Capable of cloning voices with 1~30 seconds of reference speech.
- Low Latency: Algorithm delay of ~300ms and device side delay of ~100ms.
- Zero-shot: No training required for new voices.
- OS: Windows or Linux.
- GPU: NVIDIA GPU with CUDA support is strongly recommended for real-time performance.
- Python: 3.10
-
Create and activate a virtual environment:
# Windows python -m venv venv venv\Scripts\activate # Linux python3 -m venv venv source venv/bin/activate
-
Install dependencies:
Note: The default
requirements.txtincludes CUDA 12.1 support. If you have a different CUDA version, please editrequirements.txtaccordingly.pip install -r requirements.txt
-
Run the GUI:
python real-time-gui.py --checkpoint-path <path-to-checkpoint> --config-path <path-to-config>checkpointis the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (seed-uvit-tat-xlsr-tiny)configis the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface
It is strongly recommended to use a GPU for real-time voice conversion. Below are some benchmark results on an NVIDIA RTX 3060 Laptop GPU:
| Model Configuration | Diffusion Steps | Inference CFG Rate | Max Prompt Length | Block Time (s) | Latency (ms) | Inference Time (ms) |
|---|---|---|---|---|---|---|
| seed-uvit-xlsr-tiny | 10 | 0.7 | 3.0 | 0.18s | 430ms | 150ms |
- Diffusion Steps: 4~10 recommended for fastest real-time inference.
- Block Time: The length of each audio chunk. Must be greater than inference time per block.
- Extra context: Increasing this improves stability but adds latency.
- Virtual Cable: Use VB-CABLE to route GUI output to a virtual microphone for use in other applications (Discord, Zoom, etc.).
This project is a separated version of Seed-VC. All credit goes to the original authors.
- Seed-VC - Original Project
- Amphion for providing computational resources and inspiration!
- Vevo for theoretical foundation of V2 model
- MegaTTS3 for multi-condition CFG inference implemented in V2 model
- ASTRAL-quantiztion for the amazing speaker-disentangled speech tokenizer used by V2 model
- RVC for foundationing the real-time voice conversion
- SEED-TTS for the initial idea