Added exhaustive evaluation results and comparisons with RVCv2 for singing voice conversion

Plachtaa · Plachtaa · commit 3f719fe3de6d · 2024-10-25T21:00:27.000+08:00
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@ To find a list of demos and comparisons with previous voice conversion models, p
 We are keeping on improving the model quality and adding more features.
 
 ## Evaluation📊
+### Zero-shot voice conversion🎙
 We have performed a series of objective evaluations on our Seed-VC's voice conversion capabilities. 
 For ease of reproduction, source audios are 100 random utterances from LibriTTS-test-clean, and reference audios are 12 randomly picked in-the-wild voices with unique characteristics. <br>  
 
@@ -44,7 +45,7 @@ However, this may vary a lot depending on the SoVITS model quality. PR or Issue
 (Matikane Tannhuaser model from [zomehwh/sovits-tannhauser](https://huggingface.co/spaces/zomehwh/sovits-tannhauser))  
 (Milky Green model from [sparanoid/milky-green-sovits-4](https://huggingface.co/spaces/sparanoid/milky-green-sovits-4))  
 
-*ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model*   
+*English ASR result computed by [facebook/hubert-large-ls960-ft](https://huggingface.co/facebook/hubert-large-ls960-ft) model*  
 *Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model* <br>  
 
 You can reproduce the evaluation by running `eval.py` script.  
@@ -62,6 +63,73 @@ python eval.py
 ```
 Before that, make sure you have openvoice and cosyvoice repo correctly installed on `../OpenVoice/` and `../CosyVoice/` if you would like to run baseline evaluation.
 
+### Zero-shot singing voice conversion🎤🎶
+
+Additional singing voice conversion evaluation is done on [M4Singer](https://github.com/M4Singer/M4Singer) dataset, with 4 target speakers whose audio data is available [here](https://huggingface.co/datasets/XzJosh/audiodataset).  
+Speaker similariy is calculated by averaging the cosine similarities between conversion result and all available samples in respective character dataset.   
+For each character, one random utterance is chosen as the prompt for zero-shot inference. For comparison, we trained respective [RVCv2-f0-48k](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI) model for each character as baseline.  
+100 random utterances for each singer type are used as source audio.
+
+| Models\Metrics | F0CORR↑ | F0RMSE↓ | SECS↑      | CER↓      | SIG↑     | BAK↑     | OVRL↑    |
+|----------------|---------|---------|------------|-----------|----------|----------|----------|
+| RVCv2          | 0.9404  | 30.43   | 0.7264     | 28.46     | **3.41** | **4.05** | **3.12** |
+| Seed-VC(Ours)  | 0.9375  | 33.35   | **0.7405** | **19.70** | 3.39     | 3.96     | 3.06     |
+
+<details>
+<summary>Click to expand detailed evaluation results</summary>
+
+| Source Singer Type | Characters  | Models\Metrics | F0CORR↑ | F0RMSE↓ | SECS↑      | CER↓      | SIG↑     | BAK↑     | OVRL↑    |
+|--------------------|-------------|----------------|---------|---------|------------|-----------|----------|----------|----------|
+| Alto (Female)      | ~           | Ground Truth   | 1.0000  | 0.00    | ~          | 8.16      | ~        | ~        | ~        |
+|                    | Azuma       | RVCv2          | 0.9617  | 33.03   | **0.7352** | 24.70     | 3.36     | 4.07     | 3.07     |
+|                    |             | Seed-VC(Ours)  | 0.9658  | 31.64   | 0.7341     | **15.23** | 3.37     | 4.02     | 3.07     |
+|                    | Diana       | RVCv2          | 0.9626  | 32.56   | 0.7212     | 19.67     | 3.45     | 4.08     | **3.17** |
+|                    |             | Seed-VC(Ours)  | 0.9648  | 31.94   | **0.7457** | **16.81** | 3.49     | 3.99     | 3.15     |
+|                    | Ding Zhen   | RVCv2          | 0.9013  | 26.72   | 0.7221     | 18.53     | 3.37     | 4.03     | 3.06     |
+|                    |             | Seed-VC(Ours)  | 0.9356  | 21.87   | **0.7513** | **15.63** | 3.44     | 3.94     | **3.09** |
+|                    | Kobe Bryant | RVCv2          | 0.9215  | 23.90   | 0.7495     | 37.23     | 3.49     | 4.06     | **3.21** |
+|                    |             | Seed-VC(Ours)  | 0.9248  | 23.40   | **0.7602** | **26.98** | 3.43     | 4.02     | 3.13     |
+| Bass (Male)        | ~           | Ground Truth   | 1.0000  | 0.00    | ~          | 8.62      | ~        | ~        | ~        |
+|                    | Azuma       | RVCv2          | 0.9288  | 32.62   | **0.7148** | 24.88     | 3.45     | 4.10     | **3.18** |
+|                    |             | Seed-VC(Ours)  | 0.9383  | 31.57   | 0.6960     | **10.31** | 3.45     | 4.03     | 3.15     |
+|                    | Diana       | RVCv2          | 0.9403  | 30.00   | 0.7010     | 14.54     | 3.53     | 4.15     | **3.27** |
+|                    |             | Seed-VC(Ours)  | 0.9428  | 30.06   | **0.7299** | **9.66**  | 3.53     | 4.11     | 3.25     |
+|                    | Ding Zhen   | RVCv2          | 0.9061  | 19.53   | 0.6922     | 25.99     | 3.36     | 4.09     | **3.08** |
+|                    |             | Seed-VC(Ours)  | 0.9169  | 18.15   | **0.7260** | **14.13** | 3.38     | 3.98     | 3.07     |
+|                    | Kobe Bryant | RVCv2          | 0.9302  | 16.37   | 0.7717     | 41.04     | 3.51     | 4.13     | **3.25** |
+|                    |             | Seed-VC(Ours)  | 0.9176  | 17.93   | **0.7798** | **24.23** | 3.42     | 4.08     | 3.17     |
+| Soprano (Female)   | ~           | Ground Truth   | 1.0000  | 0.00    | ~          | 27.92     | ~        | ~        | ~        |
+|                    | Azuma       | RVCv2          | 0.9742  | 47.80   | 0.7104     | 38.70     | 3.14     | 3.85     | **2.83** |
+|                    |             | Seed-VC(Ours)  | 0.9521  | 64.00   | **0.7177** | **33.10** | 3.15     | 3.86     | 2.81     |
+|                    | Diana       | RVCv2          | 0.9754  | 46.59   | **0.7319** | 32.36     | 3.14     | 3.85     | **2.83** |
+|                    |             | Seed-VC(Ours)  | 0.9573  | 59.70   | 0.7317     | **30.57** | 3.11     | 3.78     | 2.74     |
+|                    | Ding Zhen   | RVCv2          | 0.9543  | 31.45   | 0.6792     | 40.80     | 3.41     | 4.08     | **3.14** |
+|                    |             | Seed-VC(Ours)  | 0.9486  | 33.37   | **0.6979** | **34.45** | 3.41     | 3.97     | 3.10     |
+|                    | Kobe Bryant | RVCv2          | 0.9691  | 25.50   | 0.6276     | 61.59     | 3.43     | 4.04     | **3.15** |
+|                    |             | Seed-VC(Ours)  | 0.9496  | 32.76   | **0.6683** | **39.82** | 3.32     | 3.98     | 3.04     |
+| Tenor (Male)       | ~           | Ground Truth   | 1.0000  | 0.00    | ~          | 5.94      | ~        | ~        | ~        |
+|                    | Azuma       | RVCv2          | 0.9333  | 42.09   | **0.7832** | 16.66     | 3.46     | 4.07     | **3.18** |
+|                    |             | Seed-VC(Ours)  | 0.9162  | 48.06   | 0.7697     | **8.48**  | 3.38     | 3.89     | 3.01     |
+|                    | Diana       | RVCv2          | 0.9467  | 36.65   | 0.7729     | 15.28     | 3.53     | 4.08     | **3.24** |
+|                    |             | Seed-VC(Ours)  | 0.9360  | 41.49   | **0.7920** | **8.55**  | 3.49     | 3.93     | 3.13     |
+|                    | Ding Zhen   | RVCv2          | 0.9197  | 22.82   | 0.7591     | 12.92     | 3.40     | 4.02     | **3.09** |
+|                    |             | Seed-VC(Ours)  | 0.9247  | 22.77   | **0.7721** | **13.95** | 3.45     | 3.82     | 3.05     |
+|                    | Kobe Bryant | RVCv2          | 0.9415  | 19.33   | 0.7507     | 30.52     | 3.48     | 4.02     | **3.19** |
+|                    |             | Seed-VC(Ours)  | 0.9082  | 24.86   | **0.7764** | **13.35** | 3.39     | 3.93     | 3.07     |
+</details>
+  
+  
+Despite Seed-VC is not trained on the target speakers, and only one random utterance is used as prompt, it still constantly outperforms speaker-specific RVCv2 models 
+in terms of speaker similarity (SECS) and intelligibility (CER), which demonstrates the superior voice cloning capability and robustness of Seed-VC.   
+
+However, it is observed that Seed-VC's audio quality (DNSMOS) is slightly lower than RVCv2. We take this drawback seriously and 
+will give high priority to improve the audio quality in the future.  
+PR or issue is welcomed if you find this comparison unfair or inaccurate.
+
+*Chinese ASR result computed by [SenseVoiceSmall](https://github.com/FunAudioLLM/SenseVoice)*  
+*Speaker embedding computed by [resemblyzer](https://github.com/resemble-ai/Resemblyzer) model*  
+*We set +12 semitones pitch shift for male-to-female conversion and -12 semitones for female-to-male converison, otherwise 0 pitch shift*
+
 ## Installation📥
 Suggested python 3.10 on Windows or Linux.
 ```bash
@@ -114,10 +182,13 @@ Then open the browser and go to `http://localhost:7860/` to use the web interfac
 - [ ] Code for training on custom data
 - [x] Changed to BigVGAN from NVIDIA for singing voice decoding
 - [x] Whisper version model for singing voice conversion
-- [ ] Objective evaluation and comparison with RVC/SoVITS for singing voice conversion
+- [x] Objective evaluation and comparison with RVC/SoVITS for singing voice conversion
+- [ ] Improved audio quality
 - [ ] More to be added
 
 ## CHANGELOGS🗒️
+- 2024-10-25:
+    - Added exhaustive evaluation results and comparisons with RVCv2 for singing voice conversion
 - 2024-10-24:
     - Updated 44kHz singing voice conversion model, with OpenAI Whisper as speech content input
 - 2024-10-07: