A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Implement google's Tacotron TTS system with pytorch.

Updates

2018/09/15: Fix RNN feeding bug.
2018/11/04: Add attention mask and loss mask.

Requirements

Download python and pytorch.

python==3.6.5
pytorch==0.4.1

You can use requirements.txt to download packages below.

# I recommend you use virtualenv.
$ pip install -r requirements.txt

librosa
numpy
pandas
scipy
matplotlib

Usage

Data
Download LJSpeech provided by keithito. It contains 13100 short audio clips of a single speaker. The total length is approximately 20 hrs.
Set config.

# Set the 'meta_path' and 'wav_dir' in `hyperparams.py` to paths of your downloaded LJSpeech's meta file and wav directory.
meta_path = 'Data/LJSpeech-1.1/metadata.csv'
wav_dir = 'Data/LJSpeech-1.1/wavs'

Train

# If you have pretrained model, add --ckpt <ckpt_path>
$ python main.py --train --cuda

Evaluate

# You can change the evaluation texts in `hyperparams.py`
# ckpt files are saved in 'tmp/ckpt/' in default
$ python main.py --eval --cuda --ckpt <ckpt_timestep.pth.tar>

Samples

The sample texts is based on Harvard Sentences. See the samples at samples/ which are generated after training 200k.

Alignment

The model starts learning something at 30k.

Differences from the original Tacotron

Data bucketing (Original Tacotron used loss mask)
Remove residual connection in decoder_CBHG
Batch size is set to 8
Gradient clipping
Noam style learning rate decay (The mechanism that Attention is all you need applies.)

Refenrence

(Tensorflow) Kyubyong's implementation
(Tensorflow) acetylSv's implementation
(Pytorch) soobinseo's implementaition

Finally, I have to say this work is highly based on Kyubyong's work, so if you are a tensorflow user, you may want to see his work. Also, feel free to give some feedbacks!

About

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

MIT License

Languages

Language:Python 100.0%