Skip to content

alexseong/astacotron

Repository files navigation

Multi-Speaker Tacotron in TensorFlow

TensorFlow implementation of:

Samples audios (in Korean) can be found here.

model

Prerequisites

Usage

1. Install prerequisites

After preparing Tensorflow, install prerequisites with:

pip3 install -r requirements.txt
python -c "import nltk; nltk.download('punkt')"

If you want to synthesize a speech in Korean dicrectly, follow 2-3. Download pre-trained models.

2-1. Generate custom datasets

The datasets directory should look like:

datasets
├── son
│   ├── alignment.json
│   └── audio
│       ├── 1.mp3
│       ├── 2.mp3
│       ├── 3.mp3
│       └── ...
└── YOUR_DATASET
    ├── alignment.json
    └── audio
        ├── 1.mp3
        ├── 2.mp3
        ├── 3.mp3
        └── ...

and YOUR_DATASET/alignment.json should look like:

{
    "./datasets/YOUR_DATASET/audio/001.mp3": "My name is Taehoon Kim.",
    "./datasets/YOUR_DATASET/audio/002.mp3": "The buses aren't the problem.",
    "./datasets/YOUR_DATASET/audio/003.mp3": "They have discovered a new particle.",
}

After you prepare as described, you should genearte preprocessed data with:

python3 -m datasets.generate_data ./datasets/YOUR_DATASET/alignment.json

2-2. Generate Korean datasets

Follow below commands. (explain with son dataset)

  1. To automate an alignment between sounds and texts, prepare GOOGLE_APPLICATION_CREDENTIALS to use Google Speech Recognition API. To get credentials, read this.

    export GOOGLE_APPLICATION_CREDENTIALS="YOUR-GOOGLE.CREDENTIALS.json"
    
  2. Download speech(or video) and text.

    python3 -m datasets.son.download
    
  3. Segment all audios on silence.

    python3 -m audio.silence --audio_pattern "./datasets/son/audio/*.wav" --method=pydub
    
  4. By using Google Speech Recognition API, we predict sentences for all segmented audios.

    python3 -m recognition.google --audio_pattern "./datasets/son/audio/*.*.wav"
    
  5. By comparing original text and recognised text, save audio<->text pair information into ./datasets/son/alignment.json.

    python3 -m recognition.alignment --recognition_path "./datasets/son/recognition.json" --score_threshold=0.5
    
  6. Finally, generated numpy files which will be used in training.

    python3 -m datasets.generate_data ./datasets/son/alignment.json
    

Because the automatic generation is extremely naive, the dataset is noisy. However, if you have enough datasets (20+ hours with random initialization or 5+ hours with pretrained model initialization), you can expect an acceptable quality of audio synthesis.

2-3. Generate English datasets

  1. Download speech dataset at https://keithito.com/LJ-Speech-Dataset/

  2. Convert metadata CSV file to json file. (arguments are available for changing preferences)

     python3 -m datasets.LJSpeech_1_0.prepare
    
  3. Finally, generate numpy files which will be used in training.

     python3 -m datasets.generate_data ./datasets/LJSpeech_1_0
    

3. Train a model

The important hyperparameters for a models are defined in hparams.py.

(Change cleaners in hparams.py from korean_cleaners to english_cleaners to train with English dataset)

To train a single-speaker model:

python3 train.py --data_path=datasets/son
python3 train.py --data_path=datasets/son --initialize_path=PATH_TO_CHECKPOINT

To train a multi-speaker model:

# after change `model_type` in `hparams.py` to `deepvoice` or `simple`
python3 train.py --data_path=datasets/son1,datasets/son2

To restart a training from previous experiments such as logs/son-20171015:

python3 train.py --data_path=datasets/son --load_path logs/son-20171015

If you don't have good and enough (10+ hours) dataset, it would be better to use --initialize_path to use a well-trained model as initial parameters.

4. Synthesize audio

You can train your own models with:

python3 app.py --load_path logs/son-20171015 --num_speakers=1

or generate audio directly with:

python3 synthesizer.py --load_path logs/son-20171015 --text "이거 실화냐?"

4-1. Synthesizing non-korean(english) audio

For generating non-korean audio, you must set the argument --is_korean False.

python3 app.py --load_path logs/LJSpeech_1_0-20180108 --num_speakers=1 --is_korean=False
python3 synthesizer.py --load_path logs/LJSpeech_1_0-20180108 --text="Winter is coming." --is_korean=False

Results

Training attention on single speaker model:

model

Training attention on multi speaker model:

model

About

Multi-Speaker Tacotron in TensorFlow

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published