Transfer Learning from Speaker Verification to Zero-Shot Multispeaker Text-To-Speech Synthesis
Project goal: Option 2 - we are solving our own problem.
We aimed to build a TTS system that generates a dialogue from text in the voice of unseen speaker, given only few seconds of speech audio. For this task, we applied the model in the following paper:
Ye Jia et al. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. NIPS.
Instead of just using the complete code, we retrieved separate implementations of submodules of the model and constructed whole pipeline.
Before start, install required packages by running following command on linux terminal.
pip install -r requirements.txt
Inference: You should be able to run this.
- Download model checkpoint file from https://drive.google.com/open?id=1L5z0SQO9E3m8mKx7cb-yiNqXrYJI7_I7 to any directory you want, and specify it in
checkpoint_path
variable inrun.py
. - Place input text
input_text.txt
and reference voiceinput_voice.wav
in./input
folder. - Run
run.py
on python interpreter.
python3 ./run.py
- Find generated speech at
./output/generated.wav
.
Submodule training and evaluation: Dataset not included, will not run.
Synthesizer training
python3 ./train_tacotron2.py
Vocoder fine-tuning
python3 ./vocoder_preprocess.py --path=YOUR_DATASET_DIRECTORY
python3 ./train_wavernn.py
Speaker encoder evaluation (PCA and t-SNE)
python3 ./tacotron2/speaker_embed/embedding_evaluation.py
python3 ./tacotron2/speaker_embed_unsupervised/embedding_evaluation.py
Brief overview
.
project
│ README.md
| team18_final_report.pdf
│ requirements.txt
│ run.py
│ train_tacotron2.py
│ train_wavernn.py
│
└───documents
| | proposal.pdf
│ └───references
│
└───input
│ │ input_text.txt
│ │ input_voice.wav
│
└───output
│ │ mel_from_tacotron2.wav
|
└───preprocessing
│
└───tacotron2
│ └───speaker_embed
| | embedding_evaluation.py
| |
│ └───speaker_embed_unsupervised
| | embedding_evaluation.py
| |
│ └───text
| | phoneme.py
| |
│ └───train_output
│
└───vocoder
./input
: Contains user query files (input text / reference voice)
./output
: Output stored
./preprocessing
: Contains code for text parsing and preprocessing.
./tacotron2
: Directory for tacotron2-based text-to-spectrogram synthesizer. Uses RNN-based speaker embedding generator at ./tacotron2/speaker_embed
, pretrained on English speaker verification task. ./tacotron2/speaker_embed_unsupervised
contains autoencoder-based, unsupervised speaker embedding generator, which is not used.
./vocoder
: Directory for WaveNet-based mel-to-audio vocoder.
Our TTS model consists of 3 independently trained modules (location in repo described above). We constructed a merged pipeline including module 1, 2 and 3, so that running run.py
should be sufficient.
1. Speaker embedding generator (LSTM-based, Trained on English speaker verification task)
- From several seconds of voice sample in
.wav
format, generates 256-dim vector encoding speaker identity. - Source: GE2E (https://github.com/CorentinJ/Real-Time-Voice-Cloning)
2. Mel spectrogram synthesizer (Tacotron 2-based, Trained from scratch by us)
- Given input text and speaker embedding generated by module 1, generates mel spectrogram of synthesized speech in voice of reference speaker.
- Source: Tacotron 2 (https://github.com/NVIDIA/tacotron2)
3. Vocoder (WaveNet-based, pretrained on English audio - fine-tuned by us.)
- Given mel spectrogram, reconstructs speech audio in
.wav
format. - Source: WaveRNN (https://github.com/fatchord/WaveRNN)
4. [unused] Speaker embedding generator (autoencoder-based, implemented by us, trained unsupervisedly (details in report))
- From several seconds of voice sample in
.wav
format, generates 256-dim vector encoding speaker identity. - Turned out to be less effective than training module 1 on English dataset, so not used