You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+73-2Lines changed: 73 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ To find a list of demos and comparisons with previous voice conversion models, p
9
9
We are keeping on improving the model quality and adding more features.
10
10
11
11
## Evaluation📊
12
+
### Zero-shot voice conversion🎙
12
13
We have performed a series of objective evaluations on our Seed-VC's voice conversion capabilities.
13
14
For ease of reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>
14
15
@@ -44,7 +45,7 @@ However, this may vary a lot depending on the SoVITS model quality. PR or Issue
44
45
(Matikane Tannhuaser model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
45
46
(Milky Green model from [sparanoid/milky-green-sovits-4](https://huggingface.co/spaces/sparanoid/milky-green-sovits-4))
46
47
47
-
*ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model*
48
+
*English ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model*
48
49
*Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model* <br>
49
50
50
51
You can reproduce the evaluation by running `eval.py` script.
@@ -62,6 +63,73 @@ python eval.py
62
63
```
63
64
Before that, make sure you have openvoice and cosyvoice repo correctly installed on `../OpenVoice/` and `../CosyVoice/` if you would like to run baseline evaluation.
64
65
66
+
### Zero-shot singing voice conversion🎤🎶
67
+
68
+
Additional singing voice conversion evaluation is done on [M4Singer](https://github.com/M4Singer/M4Singer) dataset, with 4 target speakers whose audio data is available [here](https://huggingface.co/datasets/XzJosh/audiodataset).
69
+
Speaker similariy is calculated by averaging the cosine similarities between conversion result and all available samples in respective character dataset.
70
+
For each character, one random utterance is chosen as the prompt for zero-shot inference. For comparison, we trained respective [RVCv2-f0-48k](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) model for each character as baseline.
71
+
100 random utterances for each singer type are used as source audio.
Despite Seed-VC is not trained on the target speakers, and only one random utterance is used as prompt, it still constantly outperforms speaker-specific RVCv2 models
123
+
in terms of speaker similarity (SECS) and intelligibility (CER), which demonstrates the superior voice cloning capability and robustness of Seed-VC.
124
+
125
+
However, it is observed that Seed-VC's audio quality (DNSMOS) is slightly lower than RVCv2. We take this drawback seriously and
126
+
will give high priority to improve the audio quality in the future.
127
+
PR or issue is welcomed if you find this comparison unfair or inaccurate.
128
+
129
+
*Chinese ASR result computed by [SenseVoiceSmall](https://github.com/FunAudioLLM/SenseVoice)*
130
+
*Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model*
131
+
*We set +12 semitones pitch shift for male-to-female conversion and -12 semitones for female-to-male converison, otherwise 0 pitch shift*
132
+
65
133
## Installation📥
66
134
Suggested python 3.10 on Windows or Linux.
67
135
```bash
@@ -114,10 +182,13 @@ Then open the browser and go to `http://localhost:7860/` to use the web interfac
114
182
-[ ] Code for training on custom data
115
183
-[x] Changed to BigVGAN from NVIDIA for singing voice decoding
116
184
-[x] Whisper version model for singing voice conversion
117
-
-[ ] Objective evaluation and comparison with RVC/SoVITS for singing voice conversion
185
+
-[x] Objective evaluation and comparison with RVC/SoVITS for singing voice conversion
186
+
-[ ] Improved audio quality
118
187
-[ ] More to be added
119
188
120
189
## CHANGELOGS🗒️
190
+
- 2024-10-25:
191
+
- Added exhaustive evaluation results and comparisons with RVCv2 for singing voice conversion
121
192
- 2024-10-24:
122
193
- Updated 44kHz singing voice conversion model, with OpenAI Whisper as speech content input
0 commit comments