Skip to content

Commit fc97394

Browse files
committed
update V2 model
1 parent 8695971 commit fc97394

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+5133
-5660
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ __pycache__/
1313

1414
# IDE
1515
.vscode/
16+
.idea/
1617

1718
# misc
1819
checkpoints/
@@ -21,3 +22,7 @@ reconstructed/
2122
.python-version
2223
ruff.log
2324
/configs/inuse/
25+
runs/
26+
/garbages/
27+
/flagged/
28+
/experimental/

README.md

Lines changed: 56 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -30,14 +30,14 @@ pip install -r requirements-mac.txt
3030
```
3131

3232
## Usage🛠️
33-
We have released 3 models for different purposes:
34-
35-
| Version | Name | Purpose | Sampling Rate | Content Encoder | Vocoder | Hidden Dim | N Layers | Params | Remarks |
36-
|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|---------------|-----------------|---------|------------|----------|--------|--------------------------------------------------------|
37-
| v1.0 | seed-uvit-tat-xlsr-tiny ([🤗](https://huggingface.co/Plachta/Seed-VC/blob/main/DiT_uvit_tat_xlsr_ema.pth)[📄](configs/presets/config_dit_mel_seed_uvit_xlsr_tiny.yml)) | Voice Conversion (VC) | 22050 | XLSR-large | HIFT | 384 | 9 | 25M | suitable for real-time voice conversion |
38-
| v1.0 | seed-uvit-whisper-small-wavenet ([🤗](https://huggingface.co/Plachta/Seed-VC/blob/main/DiT_seed_v2_uvit_whisper_small_wavenet_bigvgan_pruned.pth)[📄](configs/presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml)) | Voice Conversion (VC) | 22050 | Whisper-small | BigVGAN | 512 | 13 | 98M | suitable for offline voice conversion |
39-
| v1.0 | seed-uvit-whisper-base ([🤗](https://huggingface.co/Plachta/Seed-VC/blob/main/DiT_seed_v2_uvit_whisper_base_f0_44k_bigvgan_pruned_ft_ema.pth)[📄](configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml)) | Singing Voice Conversion (SVC) | 44100 | Whisper-small | BigVGAN | 768 | 17 | 200M | strong zero-shot performance, singing voice conversion |
40-
33+
We have released 4 models for different purposes:
34+
35+
| Version | Name | Purpose | Sampling Rate | Content Encoder | Vocoder | Hidden Dim | N Layers | Params | Remarks |
36+
|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|---------------|-------------------------|---------|------------|----------|--------|--------------------------------------------------------|
37+
| v1.0 | seed-uvit-tat-xlsr-tiny ([🤗](https://huggingface.co/Plachta/Seed-VC/blob/main/DiT_uvit_tat_xlsr_ema.pth)[📄](configs/presets/config_dit_mel_seed_uvit_xlsr_tiny.yml)) | Voice Conversion (VC) | 22050 | XLSR-large | HIFT | 384 | 9 | 25M | suitable for real-time voice conversion |
38+
| v1.0 | seed-uvit-whisper-small-wavenet ([🤗](https://huggingface.co/Plachta/Seed-VC/blob/main/DiT_seed_v2_uvit_whisper_small_wavenet_bigvgan_pruned.pth)[📄](configs/presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml)) | Voice Conversion (VC) | 22050 | Whisper-small | BigVGAN | 512 | 13 | 98M | suitable for offline voice conversion |
39+
| v1.0 | seed-uvit-whisper-base ([🤗](https://huggingface.co/Plachta/Seed-VC/blob/main/DiT_seed_v2_uvit_whisper_base_f0_44k_bigvgan_pruned_ft_ema.pth)[📄](configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml)) | Singing Voice Conversion (SVC) | 44100 | Whisper-small | BigVGAN | 768 | 17 | 200M | strong zero-shot performance, singing voice conversion |
40+
| v2.0 | hubert-bsqvae-small ([🤗](https://huggingface.co/Plachta/Seed-VC/blob/main/v2)[📄](configs/v2/vc_wrapper.yaml)) | Voice & Accent Conversion (VC) | 22050 | [ASTRAL-Quantization](https://github.com/Plachtaa/ASTRAL-quantization) | BigVGAN | 512 | 13 | 67M | Best in suppressing source speaker traits |
4141
Checkpoints of the latest model release will be downloaded automatically when first run inference.
4242
If you are unable to access huggingface for network reason, try using mirror by adding `HF_ENDPOINT=https://hf-mirror.com` before every command.
4343

@@ -70,6 +70,25 @@ where:
7070
- `config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface
7171
- `fp16` is the flag to use float16 inference, default is True
7272

73+
Similarly, to use V2 model, you can run:
74+
```bash
75+
python inference_v2.py --source <source-wav>
76+
--target <referene-wav>
77+
--output <output-dir>
78+
--diffusion-steps 25 # recommended 30~50 for singingvoice conversion
79+
--length-adjust 1.0 # same as V1
80+
--intelligibility-cfg-rate 0.7 # controls how clear the output linguistic content is, recommended 0.0~1.0
81+
--similarity-cfg-rate 0.7 # controls how similar the output voice is to the reference voice, recommended 0.0~1.0
82+
--convert-style true # whether to use AR model for accent & emotion conversion, set to false will only conduct timbre conversion similar to V1
83+
--anonymization-only false # set to true will ignore reference audio but only anonymize source speech to an "average voice"
84+
--top-p 0.9 # controls the diversity of the AR model output, recommended 0.5~1.0
85+
--temperature 1.0 # controls the randomness of the AR model output, recommended 0.7~1.2
86+
--repetition-penalty 1.0 # penalizes the repetition of the AR model output, recommended 1.0~1.5
87+
--cfm-checkpoint-path <path-to-cfm-checkpoint> # path to the checkpoint of the CFM model, leave to blank to auto-download default model from huggingface
88+
--ar-checkpoint-path <path-to-ar-checkpoint> # path to the checkpoint of the AR model, leave to blank to auto-download default model from huggingface
89+
```
90+
91+
7392
Voice Conversion Web UI:
7493
```bash
7594
python app_vc.py --checkpoint <path-to-checkpoint> --config <path-to-config> --fp16 True
@@ -86,11 +105,18 @@ python app_svc.py --checkpoint <path-to-checkpoint> --config <path-to-config> --
86105
- `checkpoint` is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (`seed-uvit-whisper-base`)
87106
- `config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface
88107

108+
V2 model Web UI:
109+
```bash
110+
python app_vc_v2.py --cfm-checkpoint-path <path-to-cfm-checkpoint> --ar-checkpoint-path <path-to-ar-checkpoint>
111+
```
112+
- `cfm-checkpoint-path` is the path to the checkpoint of the CFM model, leave to blank to auto-download default model from huggingface
113+
- `ar-checkpoint-path` is the path to the checkpoint of the AR model, leave to blank to auto-download default model from huggingface
89114
Integrated Web UI:
90115
```bash
91-
python app.py
116+
python app.py --enable-v1 --enable-v2
92117
```
93-
This will only load pretrained models for zero-shot inference. To use custom checkpoints, please run `app_vc.py` or `app_svc.py` as above.
118+
This will only load pretrained models for zero-shot inference. To use custom checkpoints, please run `app_vc.py` or `app_svc.py` as above.
119+
If you have limited memory, remove `--enable-v2` or `--enable-v1` to only load one of the model sets.
94120

95121
Real-time voice conversion GUI:
96122
```bash
@@ -162,6 +188,19 @@ where:
162188
- `save-every` is the number of steps to save the model checkpoint
163189
- `num-workers` is the number of workers for data loading, set to 0 for Windows
164190

191+
Similarly, to train V2 model, you can run: (note that V2 training script supports multi-GPU training)
192+
```bash
193+
accelerate launch train_v2.py
194+
--dataset-dir <path-to-data>
195+
--run-name <run-name>
196+
--batch-size 2
197+
--max-steps 1000
198+
--max-epochs 1000
199+
--save-every 500
200+
--num-workers 0
201+
--train-cfm
202+
```
203+
165204
4. If training accidentially stops, you can resume training by running the same command again, the training will continue from the last checkpoint. (Make sure `run-name` and `config` arguments are the same so that latest checkpoint can be found)
166205

167206
5. After training, you can use the trained model for inference by specifying the path to the checkpoint and config file.
@@ -191,15 +230,18 @@ where:
191230
- [ ] NSF vocoder for better singing voice conversion
192231
- [x] Fix real-time voice conversion artifact while not talking (done by adding a VAD model)
193232
- [x] Colab Notebook for fine-tuning example
194-
- [ ] Replace whisper with more advanced linguistic content extractor
233+
- [x] Replace whisper with more advanced linguistic content extractor
195234
- [ ] More to be added
196235
- [x] Add Apple Silicon support
236+
- [ ] Release paper, evaluations and demo page for V2 model
197237

198238
## Known Issues
199239
- On Mac - running `real-time-gui.py` might raise an error `ModuleNotFoundError: No module named '_tkinter'`, in this case a new Python version **with Tkinter support** should be installed. Refer to [This Guide on stack overflow](https://stackoverflow.com/questions/76105218/why-does-tkinter-or-turtle-seem-to-be-missing-or-broken-shouldnt-it-be-part) for explanation of the problem and a detailed fix.
200240

201241

202242
## CHANGELOGS🗒️
243+
- 2024-04-16
244+
- Released V2 model for voice and accent conversion, with better anonymization of source speaker
203245
- 2025-03-03:
204246
- Added Mac M Series (Apple Silicon) support
205247
- 2024-11-26:
@@ -231,5 +273,8 @@ where:
231273

232274
## Acknowledgements🙏
233275
- [Amphion](https://github.com/open-mmlab/Amphion) for providing computational resources and inspiration!
276+
- [Vevo](https://github.com/open-mmlab/Amphion/tree/main/models/vc/vevo) for theoretical foundation of V2 model
277+
- [MegaTTS3](https://github.com/bytedance/MegaTTS3) for multi-condition CFG inference implemented in V2 model
278+
- [ASTRAL-quantiztion](https://github.com/Plachtaa/ASTRAL-quantization) for the amazing speaker-disentangled speech tokenizer used by V2 model
234279
- [RVC](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) for foundationing the real-time voice conversion
235280
- [SEED-TTS](https://arxiv.org/abs/2406.02430) for the initial idea

0 commit comments

Comments
 (0)