Team 18 CS470 Final Project

Transfer Learning from Speaker Verification to Zero-Shot Multispeaker Text-To-Speech Synthesis

What is this?

Project goal: Option 2 - we are solving our own problem.
We aimed to build a TTS system that generates a dialogue from text in the voice of unseen speaker, given only few seconds of speech audio. For this task, we applied the model in the following paper:

Ye Jia et al. (2018). Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. NIPS.

Instead of just using the complete code, we retrieved separate implementations of submodules of the model and constructed whole pipeline.

How to use

Before start, install required packages by running following command on linux terminal.

pip install -r requirements.txt

Inference: You should be able to run this.

Download model checkpoint file from https://drive.google.com/open?id=1L5z0SQO9E3m8mKx7cb-yiNqXrYJI7_I7 to any directory you want, and specify it in checkpoint_path variable in run.py.
Place input text input_text.txt and reference voice input_voice.wav in ./input folder.
Run run.py on python interpreter.

python3 ./run.py

Find generated speech at ./output/generated.wav.

Submodule training and evaluation: Dataset not included, will not run.

Synthesizer training

python3 ./train_tacotron2.py

Vocoder fine-tuning

python3 ./vocoder_preprocess.py --path=YOUR_DATASET_DIRECTORY
python3 ./train_wavernn.py

Speaker encoder evaluation (PCA and t-SNE)

python3 ./tacotron2/speaker_embed/embedding_evaluation.py
python3 ./tacotron2/speaker_embed_unsupervised/embedding_evaluation.py

Repository structure

Brief overview
.
project
│   README.md
|   team18_final_report.pdf
│   requirements.txt    
│   run.py    
│   train_tacotron2.py    
│   train_wavernn.py    
│   
└───documents
|   |   proposal.pdf
│   └───references
│   
└───input
│   │   input_text.txt
│   │   input_voice.wav
│   
└───output
│   │   mel_from_tacotron2.wav
|
└───preprocessing
│   
└───tacotron2
│   └───speaker_embed
|   |   embedding_evaluation.py
|   |
│   └───speaker_embed_unsupervised
|   |   embedding_evaluation.py
|   |
│   └───text
|   |   phoneme.py
|   |
│   └───train_output
│   
└───vocoder

./input: Contains user query files (input text / reference voice)
./output: Output stored
./preprocessing: Contains code for text parsing and preprocessing.
./tacotron2: Directory for tacotron2-based text-to-spectrogram synthesizer. Uses RNN-based speaker embedding generator at ./tacotron2/speaker_embed, pretrained on English speaker verification task. ./tacotron2/speaker_embed_unsupervised contains autoencoder-based, unsupervised speaker embedding generator, which is not used.
./vocoder: Directory for WaveNet-based mel-to-audio vocoder.

Module description

Our TTS model consists of 3 independently trained modules (location in repo described above). We constructed a merged pipeline including module 1, 2 and 3, so that running run.py should be sufficient.

1. Speaker embedding generator (LSTM-based, Trained on English speaker verification task)

From several seconds of voice sample in .wav format, generates 256-dim vector encoding speaker identity.
Source: GE2E (https://github.com/CorentinJ/Real-Time-Voice-Cloning)

2. Mel spectrogram synthesizer (Tacotron 2-based, Trained from scratch by us)

Given input text and speaker embedding generated by module 1, generates mel spectrogram of synthesized speech in voice of reference speaker.
Source: Tacotron 2 (https://github.com/NVIDIA/tacotron2)

3. Vocoder (WaveNet-based, pretrained on English audio - fine-tuned by us.)

Given mel spectrogram, reconstructs speech audio in .wav format.
Source: WaveRNN (https://github.com/fatchord/WaveRNN)

4. [unused] Speaker embedding generator (autoencoder-based, implemented by us, trained unsupervisedly (details in report))

From several seconds of voice sample in .wav format, generates 256-dim vector encoding speaker identity.
Turned out to be less effective than training module 1 on English dataset, so not used

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Team 18 CS470 Final Project

What is this?

How to use

Repository structure

Module description

About

Uh oh!

Releases 2

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 258 Commits
preprocessing		preprocessing
tacotron2		tacotron2
vocoder		vocoder
.gitignore		.gitignore
README.md		README.md
hparams.py		hparams.py
requirements.txt		requirements.txt
run.py		run.py
train_tacotron2.py		train_tacotron2.py
train_wavernn.py		train_wavernn.py
vocoder_preprocess.py		vocoder_preprocess.py

tapsoft/tts

Folders and files

Latest commit

History

Repository files navigation

Team 18 CS470 Final Project

What is this?

How to use

Repository structure

Module description

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages