You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Checkpoints of the latest model release will be downloaded automatically when first run inference.
42
42
If you are unable to access huggingface for network reason, try using mirror by adding `HF_ENDPOINT=https://hf-mirror.com` before every command.
43
43
@@ -70,6 +70,25 @@ where:
70
70
-`config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface
71
71
-`fp16` is the flag to use float16 inference, default is True
72
72
73
+
Similarly, to use V2 model, you can run:
74
+
```bash
75
+
python inference_v2.py --source <source-wav>
76
+
--target <referene-wav>
77
+
--output <output-dir>
78
+
--diffusion-steps 25 # recommended 30~50 for singingvoice conversion
79
+
--length-adjust 1.0 # same as V1
80
+
--intelligibility-cfg-rate 0.7 # controls how clear the output linguistic content is, recommended 0.0~1.0
81
+
--similarity-cfg-rate 0.7 # controls how similar the output voice is to the reference voice, recommended 0.0~1.0
82
+
--convert-style true# whether to use AR model for accent & emotion conversion, set to false will only conduct timbre conversion similar to V1
83
+
--anonymization-only false# set to true will ignore reference audio but only anonymize source speech to an "average voice"
84
+
--top-p 0.9 # controls the diversity of the AR model output, recommended 0.5~1.0
85
+
--temperature 1.0 # controls the randomness of the AR model output, recommended 0.7~1.2
86
+
--repetition-penalty 1.0 # penalizes the repetition of the AR model output, recommended 1.0~1.5
87
+
--cfm-checkpoint-path <path-to-cfm-checkpoint># path to the checkpoint of the CFM model, leave to blank to auto-download default model from huggingface
88
+
--ar-checkpoint-path <path-to-ar-checkpoint># path to the checkpoint of the AR model, leave to blank to auto-download default model from huggingface
-`checkpoint` is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (`seed-uvit-whisper-base`)
87
106
-`config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface
-`cfm-checkpoint-path` is the path to the checkpoint of the CFM model, leave to blank to auto-download default model from huggingface
113
+
-`ar-checkpoint-path` is the path to the checkpoint of the AR model, leave to blank to auto-download default model from huggingface
89
114
Integrated Web UI:
90
115
```bash
91
-
python app.py
116
+
python app.py --enable-v1 --enable-v2
92
117
```
93
-
This will only load pretrained models for zero-shot inference. To use custom checkpoints, please run `app_vc.py` or `app_svc.py` as above.
118
+
This will only load pretrained models for zero-shot inference. To use custom checkpoints, please run `app_vc.py` or `app_svc.py` as above.
119
+
If you have limited memory, remove `--enable-v2` or `--enable-v1` to only load one of the model sets.
94
120
95
121
Real-time voice conversion GUI:
96
122
```bash
@@ -162,6 +188,19 @@ where:
162
188
-`save-every` is the number of steps to save the model checkpoint
163
189
-`num-workers` is the number of workers for data loading, set to 0 for Windows
164
190
191
+
Similarly, to train V2 model, you can run: (note that V2 training script supports multi-GPU training)
192
+
```bash
193
+
accelerate launch train_v2.py
194
+
--dataset-dir <path-to-data>
195
+
--run-name <run-name>
196
+
--batch-size 2
197
+
--max-steps 1000
198
+
--max-epochs 1000
199
+
--save-every 500
200
+
--num-workers 0
201
+
--train-cfm
202
+
```
203
+
165
204
4. If training accidentially stops, you can resume training by running the same command again, the training will continue from the last checkpoint. (Make sure `run-name` and `config` arguments are the same so that latest checkpoint can be found)
166
205
167
206
5. After training, you can use the trained model for inference by specifying the path to the checkpoint and config file.
@@ -191,15 +230,18 @@ where:
191
230
-[ ] NSF vocoder for better singing voice conversion
192
231
-[x] Fix real-time voice conversion artifact while not talking (done by adding a VAD model)
193
232
-[x] Colab Notebook for fine-tuning example
194
-
-[] Replace whisper with more advanced linguistic content extractor
233
+
-[x] Replace whisper with more advanced linguistic content extractor
195
234
-[ ] More to be added
196
235
-[x] Add Apple Silicon support
236
+
-[ ] Release paper, evaluations and demo page for V2 model
197
237
198
238
## Known Issues
199
239
- On Mac - running `real-time-gui.py` might raise an error `ModuleNotFoundError: No module named '_tkinter'`, in this case a new Python version **with Tkinter support** should be installed. Refer to [This Guide on stack overflow](https://stackoverflow.com/questions/76105218/why-does-tkinter-or-turtle-seem-to-be-missing-or-broken-shouldnt-it-be-part) for explanation of the problem and a detailed fix.
200
240
201
241
202
242
## CHANGELOGS🗒️
243
+
- 2024-04-16
244
+
- Released V2 model for voice and accent conversion, with better anonymization of source speaker
203
245
- 2025-03-03:
204
246
- Added Mac M Series (Apple Silicon) support
205
247
- 2024-11-26:
@@ -231,5 +273,8 @@ where:
231
273
232
274
## Acknowledgements🙏
233
275
-[Amphion](https://github.com/open-mmlab/Amphion) for providing computational resources and inspiration!
276
+
-[Vevo](https://github.com/open-mmlab/Amphion/tree/main/models/vc/vevo) for theoretical foundation of V2 model
277
+
-[MegaTTS3](https://github.com/bytedance/MegaTTS3) for multi-condition CFG inference implemented in V2 model
278
+
-[ASTRAL-quantiztion](https://github.com/Plachtaa/ASTRAL-quantization) for the amazing speaker-disentangled speech tokenizer used by V2 model
234
279
-[RVC](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) for foundationing the real-time voice conversion
235
280
-[SEED-TTS](https://arxiv.org/abs/2406.02430) for the initial idea
0 commit comments