Tacotron is an end-to-end speech generation model which was first introduced in Towards End-to-End Speech Synthesis. It takes as input text at the character level, and targets mel filterbanks and the linear spectrogram. Although it is a generation model, I felt like testing how well it can be applied to the speech recognition task.
- NumPy >= 1.11.1
- TensorFlow == 1.1
- librosa
I use the VCTK Corpus, one of the most popular speech corpora, for my experiment. Because there's no pre-defined split of training and evaluation, 10*(mini batch) samples that don't appear in the training set are reserved for evaluation.
hyperparams.pyincludes all hyper parameters.prepro.pycreates training and evaluation data todata/folder.data_load.pyloads data and put them in queues so multiple mini-bach data are generated in parallel.utils.pyhas some operational functions.modules.pycontains building blocks for encoding and decoding networks.networks.pydefines encoding and decoding networks.train.pyexecutes training.eval.pyexecutes evaluation.
- STEP 1. Download and extract VCTK Corpus and adjust the value of 'vctk' in
hyperparams.py. - STEP 2. Adjust other hyper parameters in
hyperparams.pyif necessary. - STEP 3. Run
train_multiple_gpus.pyif you want to use more than one gpu, otherwisetrain.py.
- Run
eval.pyto get speech recognition results for the test set.