Skip to content

Transfer Learning from Speaker Verification to Zero-Shot Multispeaker Text-To-Speech Synthesis

Notifications You must be signed in to change notification settings

tapsoft/tts

 
 

Repository files navigation

Team 18 CS470 Final Project

Transfer Learning from Speaker Verification to Zero-Shot Multispeaker Text-To-Speech Synthesis

What is this?

Project goal: Option 2 - we are solving our own problem.
We aimed to build a TTS system that generates a dialogue from text in the voice of unseen speaker, given only few seconds of speech audio. For this task, we applied the model in the following paper:

Ye Jia et al. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. NIPS.

Instead of just using the complete code, we retrieved separate implementations of submodules of the model and constructed whole pipeline.

How to use

Before start, install required packages by running following command on linux terminal.

pip install -r requirements.txt

Inference: You should be able to run this.

  1. Download model checkpoint file from https://drive.google.com/open?id=1L5z0SQO9E3m8mKx7cb-yiNqXrYJI7_I7 to any directory you want, and specify it in checkpoint_path variable in run.py.
  2. Place input text input_text.txt and reference voice input_voice.wav in ./input folder.
  3. Run run.py on python interpreter.
python3 ./run.py
  1. Find generated speech at ./output/generated.wav.

Submodule training and evaluation: Dataset not included, will not run.

Synthesizer training

python3 ./train_tacotron2.py

Vocoder fine-tuning

python3 ./vocoder_preprocess.py --path=YOUR_DATASET_DIRECTORY
python3 ./train_wavernn.py

Speaker encoder evaluation (PCA and t-SNE)

python3 ./tacotron2/speaker_embed/embedding_evaluation.py
python3 ./tacotron2/speaker_embed_unsupervised/embedding_evaluation.py

Repository structure

Brief overview
.
project
│   README.md
|   team18_final_report.pdf
│   requirements.txt    
│   run.py    
│   train_tacotron2.py    
│   train_wavernn.py    
│   
└───documents
|   |   proposal.pdf
│   └───references
│   
└───input
│   │   input_text.txt
│   │   input_voice.wav
│   
└───output
│   │   mel_from_tacotron2.wav
|
└───preprocessing
│   
└───tacotron2
│   └───speaker_embed
|   |   embedding_evaluation.py
|   |
│   └───speaker_embed_unsupervised
|   |   embedding_evaluation.py
|   |
│   └───text
|   |   phoneme.py
|   |
│   └───train_output
│   
└───vocoder

./input: Contains user query files (input text / reference voice)
./output: Output stored
./preprocessing: Contains code for text parsing and preprocessing.
./tacotron2: Directory for tacotron2-based text-to-spectrogram synthesizer. Uses RNN-based speaker embedding generator at ./tacotron2/speaker_embed, pretrained on English speaker verification task. ./tacotron2/speaker_embed_unsupervised contains autoencoder-based, unsupervised speaker embedding generator, which is not used.
./vocoder: Directory for WaveNet-based mel-to-audio vocoder.

Module description

Our TTS model consists of 3 independently trained modules (location in repo described above). We constructed a merged pipeline including module 1, 2 and 3, so that running run.py should be sufficient.

1. Speaker embedding generator (LSTM-based, Trained on English speaker verification task)

2. Mel spectrogram synthesizer (Tacotron 2-based, Trained from scratch by us)

  • Given input text and speaker embedding generated by module 1, generates mel spectrogram of synthesized speech in voice of reference speaker.
  • Source: Tacotron 2 (https://github.com/NVIDIA/tacotron2)

3. Vocoder (WaveNet-based, pretrained on English audio - fine-tuned by us.)

4. [unused] Speaker embedding generator (autoencoder-based, implemented by us, trained unsupervisedly (details in report))

  • From several seconds of voice sample in .wav format, generates 256-dim vector encoding speaker identity.
  • Turned out to be less effective than training module 1 on English dataset, so not used

About

Transfer Learning from Speaker Verification to Zero-Shot Multispeaker Text-To-Speech Synthesis

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.9%
  • Dockerfile 0.1%