TensorFlow implementation of:
- Deep Voice 2: Multi-Speaker Neural Text-to-Speech
- Listening while Speaking: Speech Chain by Deep Learning
- Tacotron: Towards End-to-End Speech Synthesis
Samples audios (in Korean) can be found here.
- Python 3.6+
- FFmpeg
- Tensorflow 1.3
After preparing Tensorflow, install prerequisites with:
pip3 install -r requirements.txt
python -c "import nltk; nltk.download('punkt')"
If you want to synthesize a speech in Korean dicrectly, follow 2-3. Download pre-trained models.
The datasets
directory should look like:
datasets
├── son
│ ├── alignment.json
│ └── audio
│ ├── 1.mp3
│ ├── 2.mp3
│ ├── 3.mp3
│ └── ...
└── YOUR_DATASET
├── alignment.json
└── audio
├── 1.mp3
├── 2.mp3
├── 3.mp3
└── ...
and YOUR_DATASET/alignment.json
should look like:
{
"./datasets/YOUR_DATASET/audio/001.mp3": "My name is Taehoon Kim.",
"./datasets/YOUR_DATASET/audio/002.mp3": "The buses aren't the problem.",
"./datasets/YOUR_DATASET/audio/003.mp3": "They have discovered a new particle.",
}
After you prepare as described, you should genearte preprocessed data with:
python3 -m datasets.generate_data ./datasets/YOUR_DATASET/alignment.json
Follow below commands. (explain with son
dataset)
-
To automate an alignment between sounds and texts, prepare
GOOGLE_APPLICATION_CREDENTIALS
to use Google Speech Recognition API. To get credentials, read this.export GOOGLE_APPLICATION_CREDENTIALS="YOUR-GOOGLE.CREDENTIALS.json"
-
Download speech(or video) and text.
python3 -m datasets.son.download
-
Segment all audios on silence.
python3 -m audio.silence --audio_pattern "./datasets/son/audio/*.wav" --method=pydub
-
By using Google Speech Recognition API, we predict sentences for all segmented audios.
python3 -m recognition.google --audio_pattern "./datasets/son/audio/*.*.wav"
-
By comparing original text and recognised text, save
audio<->text
pair information into./datasets/son/alignment.json
.python3 -m recognition.alignment --recognition_path "./datasets/son/recognition.json" --score_threshold=0.5
-
Finally, generated numpy files which will be used in training.
python3 -m datasets.generate_data ./datasets/son/alignment.json
Because the automatic generation is extremely naive, the dataset is noisy. However, if you have enough datasets (20+ hours with random initialization or 5+ hours with pretrained model initialization), you can expect an acceptable quality of audio synthesis.
-
Download speech dataset at https://keithito.com/LJ-Speech-Dataset/
-
Convert metadata CSV file to json file. (arguments are available for changing preferences)
python3 -m datasets.LJSpeech_1_0.prepare
-
Finally, generate numpy files which will be used in training.
python3 -m datasets.generate_data ./datasets/LJSpeech_1_0
The important hyperparameters for a models are defined in hparams.py
.
(Change cleaners
in hparams.py
from korean_cleaners
to english_cleaners
to train with English dataset)
To train a single-speaker model:
python3 train.py --data_path=datasets/son
python3 train.py --data_path=datasets/son --initialize_path=PATH_TO_CHECKPOINT
To train a multi-speaker model:
# after change `model_type` in `hparams.py` to `deepvoice` or `simple`
python3 train.py --data_path=datasets/son1,datasets/son2
To restart a training from previous experiments such as logs/son-20171015
:
python3 train.py --data_path=datasets/son --load_path logs/son-20171015
If you don't have good and enough (10+ hours) dataset, it would be better to use --initialize_path
to use a well-trained model as initial parameters.
You can train your own models with:
python3 app.py --load_path logs/son-20171015 --num_speakers=1
or generate audio directly with:
python3 synthesizer.py --load_path logs/son-20171015 --text "이거 실화냐?"
For generating non-korean audio, you must set the argument --is_korean False.
python3 app.py --load_path logs/LJSpeech_1_0-20180108 --num_speakers=1 --is_korean=False
python3 synthesizer.py --load_path logs/LJSpeech_1_0-20180108 --text="Winter is coming." --is_korean=False
Training attention on single speaker model:
Training attention on multi speaker model: