Skip to content

Commit 3f719fe

Browse files
committed
Added exhaustive evaluation results and comparisons with RVCv2 for singing voice conversion
1 parent 831bfb6 commit 3f719fe

File tree

1 file changed

+73
-2
lines changed

1 file changed

+73
-2
lines changed

README.md

Lines changed: 73 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ To find a list of demos and comparisons with previous voice conversion models, p
99
We are keeping on improving the model quality and adding more features.
1010

1111
## Evaluation📊
12+
### Zero-shot voice conversion🎙
1213
We have performed a series of objective evaluations on our Seed-VC's voice conversion capabilities.
1314
For ease of reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>
1415

@@ -44,7 +45,7 @@ However, this may vary a lot depending on the SoVITS model quality. PR or Issue
4445
(Matikane Tannhuaser model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))
4546
(Milky Green model from [sparanoid/milky-green-sovits-4](https://huggingface.co/spaces/sparanoid/milky-green-sovits-4))
4647

47-
*ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model*
48+
*English ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model*
4849
*Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model* <br>
4950

5051
You can reproduce the evaluation by running `eval.py` script.
@@ -62,6 +63,73 @@ python eval.py
6263
```
6364
Before that, make sure you have openvoice and cosyvoice repo correctly installed on `../OpenVoice/` and `../CosyVoice/` if you would like to run baseline evaluation.
6465

66+
### Zero-shot singing voice conversion🎤🎶
67+
68+
Additional singing voice conversion evaluation is done on [M4Singer](https://github.com/M4Singer/M4Singer) dataset, with 4 target speakers whose audio data is available [here](https://huggingface.co/datasets/XzJosh/audiodataset).
69+
Speaker similariy is calculated by averaging the cosine similarities between conversion result and all available samples in respective character dataset.
70+
For each character, one random utterance is chosen as the prompt for zero-shot inference. For comparison, we trained respective [RVCv2-f0-48k](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) model for each character as baseline.
71+
100 random utterances for each singer type are used as source audio.
72+
73+
| Models\Metrics | F0CORR↑ | F0RMSE↓ | SECS↑ | CER↓ | SIG↑ | BAK↑ | OVRL↑ |
74+
|----------------|---------|---------|------------|-----------|----------|----------|----------|
75+
| RVCv2 | 0.9404 | 30.43 | 0.7264 | 28.46 | **3.41** | **4.05** | **3.12** |
76+
| Seed-VC(Ours) | 0.9375 | 33.35 | **0.7405** | **19.70** | 3.39 | 3.96 | 3.06 |
77+
78+
<details>
79+
<summary>Click to expand detailed evaluation results</summary>
80+
81+
| Source Singer Type | Characters | Models\Metrics | F0CORR↑ | F0RMSE↓ | SECS↑ | CER↓ | SIG↑ | BAK↑ | OVRL↑ |
82+
|--------------------|-------------|----------------|---------|---------|------------|-----------|----------|----------|----------|
83+
| Alto (Female) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 8.16 | ~ | ~ | ~ |
84+
| | Azuma | RVCv2 | 0.9617 | 33.03 | **0.7352** | 24.70 | 3.36 | 4.07 | 3.07 |
85+
| | | Seed-VC(Ours) | 0.9658 | 31.64 | 0.7341 | **15.23** | 3.37 | 4.02 | 3.07 |
86+
| | Diana | RVCv2 | 0.9626 | 32.56 | 0.7212 | 19.67 | 3.45 | 4.08 | **3.17** |
87+
| | | Seed-VC(Ours) | 0.9648 | 31.94 | **0.7457** | **16.81** | 3.49 | 3.99 | 3.15 |
88+
| | Ding Zhen | RVCv2 | 0.9013 | 26.72 | 0.7221 | 18.53 | 3.37 | 4.03 | 3.06 |
89+
| | | Seed-VC(Ours) | 0.9356 | 21.87 | **0.7513** | **15.63** | 3.44 | 3.94 | **3.09** |
90+
| | Kobe Bryant | RVCv2 | 0.9215 | 23.90 | 0.7495 | 37.23 | 3.49 | 4.06 | **3.21** |
91+
| | | Seed-VC(Ours) | 0.9248 | 23.40 | **0.7602** | **26.98** | 3.43 | 4.02 | 3.13 |
92+
| Bass (Male) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 8.62 | ~ | ~ | ~ |
93+
| | Azuma | RVCv2 | 0.9288 | 32.62 | **0.7148** | 24.88 | 3.45 | 4.10 | **3.18** |
94+
| | | Seed-VC(Ours) | 0.9383 | 31.57 | 0.6960 | **10.31** | 3.45 | 4.03 | 3.15 |
95+
| | Diana | RVCv2 | 0.9403 | 30.00 | 0.7010 | 14.54 | 3.53 | 4.15 | **3.27** |
96+
| | | Seed-VC(Ours) | 0.9428 | 30.06 | **0.7299** | **9.66** | 3.53 | 4.11 | 3.25 |
97+
| | Ding Zhen | RVCv2 | 0.9061 | 19.53 | 0.6922 | 25.99 | 3.36 | 4.09 | **3.08** |
98+
| | | Seed-VC(Ours) | 0.9169 | 18.15 | **0.7260** | **14.13** | 3.38 | 3.98 | 3.07 |
99+
| | Kobe Bryant | RVCv2 | 0.9302 | 16.37 | 0.7717 | 41.04 | 3.51 | 4.13 | **3.25** |
100+
| | | Seed-VC(Ours) | 0.9176 | 17.93 | **0.7798** | **24.23** | 3.42 | 4.08 | 3.17 |
101+
| Soprano (Female) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 27.92 | ~ | ~ | ~ |
102+
| | Azuma | RVCv2 | 0.9742 | 47.80 | 0.7104 | 38.70 | 3.14 | 3.85 | **2.83** |
103+
| | | Seed-VC(Ours) | 0.9521 | 64.00 | **0.7177** | **33.10** | 3.15 | 3.86 | 2.81 |
104+
| | Diana | RVCv2 | 0.9754 | 46.59 | **0.7319** | 32.36 | 3.14 | 3.85 | **2.83** |
105+
| | | Seed-VC(Ours) | 0.9573 | 59.70 | 0.7317 | **30.57** | 3.11 | 3.78 | 2.74 |
106+
| | Ding Zhen | RVCv2 | 0.9543 | 31.45 | 0.6792 | 40.80 | 3.41 | 4.08 | **3.14** |
107+
| | | Seed-VC(Ours) | 0.9486 | 33.37 | **0.6979** | **34.45** | 3.41 | 3.97 | 3.10 |
108+
| | Kobe Bryant | RVCv2 | 0.9691 | 25.50 | 0.6276 | 61.59 | 3.43 | 4.04 | **3.15** |
109+
| | | Seed-VC(Ours) | 0.9496 | 32.76 | **0.6683** | **39.82** | 3.32 | 3.98 | 3.04 |
110+
| Tenor (Male) | ~ | Ground Truth | 1.0000 | 0.00 | ~ | 5.94 | ~ | ~ | ~ |
111+
| | Azuma | RVCv2 | 0.9333 | 42.09 | **0.7832** | 16.66 | 3.46 | 4.07 | **3.18** |
112+
| | | Seed-VC(Ours) | 0.9162 | 48.06 | 0.7697 | **8.48** | 3.38 | 3.89 | 3.01 |
113+
| | Diana | RVCv2 | 0.9467 | 36.65 | 0.7729 | 15.28 | 3.53 | 4.08 | **3.24** |
114+
| | | Seed-VC(Ours) | 0.9360 | 41.49 | **0.7920** | **8.55** | 3.49 | 3.93 | 3.13 |
115+
| | Ding Zhen | RVCv2 | 0.9197 | 22.82 | 0.7591 | 12.92 | 3.40 | 4.02 | **3.09** |
116+
| | | Seed-VC(Ours) | 0.9247 | 22.77 | **0.7721** | **13.95** | 3.45 | 3.82 | 3.05 |
117+
| | Kobe Bryant | RVCv2 | 0.9415 | 19.33 | 0.7507 | 30.52 | 3.48 | 4.02 | **3.19** |
118+
| | | Seed-VC(Ours) | 0.9082 | 24.86 | **0.7764** | **13.35** | 3.39 | 3.93 | 3.07 |
119+
</details>
120+
121+
122+
Despite Seed-VC is not trained on the target speakers, and only one random utterance is used as prompt, it still constantly outperforms speaker-specific RVCv2 models
123+
in terms of speaker similarity (SECS) and intelligibility (CER), which demonstrates the superior voice cloning capability and robustness of Seed-VC.
124+
125+
However, it is observed that Seed-VC's audio quality (DNSMOS) is slightly lower than RVCv2. We take this drawback seriously and
126+
will give high priority to improve the audio quality in the future.
127+
PR or issue is welcomed if you find this comparison unfair or inaccurate.
128+
129+
*Chinese ASR result computed by [SenseVoiceSmall](https://github.com/FunAudioLLM/SenseVoice)*
130+
*Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model*
131+
*We set +12 semitones pitch shift for male-to-female conversion and -12 semitones for female-to-male converison, otherwise 0 pitch shift*
132+
65133
## Installation📥
66134
Suggested python 3.10 on Windows or Linux.
67135
```bash
@@ -114,10 +182,13 @@ Then open the browser and go to `http://localhost:7860/` to use the web interfac
114182
- [ ] Code for training on custom data
115183
- [x] Changed to BigVGAN from NVIDIA for singing voice decoding
116184
- [x] Whisper version model for singing voice conversion
117-
- [ ] Objective evaluation and comparison with RVC/SoVITS for singing voice conversion
185+
- [x] Objective evaluation and comparison with RVC/SoVITS for singing voice conversion
186+
- [ ] Improved audio quality
118187
- [ ] More to be added
119188

120189
## CHANGELOGS🗒️
190+
- 2024-10-25:
191+
- Added exhaustive evaluation results and comparisons with RVCv2 for singing voice conversion
121192
- 2024-10-24:
122193
- Updated 44kHz singing voice conversion model, with OpenAI Whisper as speech content input
123194
- 2024-10-07:

0 commit comments

Comments
 (0)