Implement google's Tacotron TTS system with pytorch.
2018/09/15: Fix RNN feeding bug.
2018/11/04: Add attention mask and loss mask.
Download python and pytorch.
- python==3.6.5
- pytorch==0.4.1
You can use requirements.txt to download packages below.
# I recommend you use virtualenv.
$ pip install -r requirements.txt
- librosa
- numpy
- pandas
- scipy
- matplotlib
-
Data
Download LJSpeech provided by keithito. It contains 13100 short audio clips of a single speaker. The total length is approximately 20 hrs. -
Set config.
# Set the 'meta_path' and 'wav_dir' in `hyperparams.py` to paths of your downloaded LJSpeech's meta file and wav directory.
meta_path = 'Data/LJSpeech-1.1/metadata.csv'
wav_dir = 'Data/LJSpeech-1.1/wavs'
- Train
# If you have pretrained model, add --ckpt <ckpt_path>
$ python main.py --train --cuda
- Evaluate
# You can change the evaluation texts in `hyperparams.py`
# ckpt files are saved in 'tmp/ckpt/' in default
$ python main.py --eval --cuda --ckpt <ckpt_timestep.pth.tar>
The sample texts is based on Harvard Sentences. See the samples at samples/
which are generated after training 200k.
The model starts learning something at 30k.
- Data bucketing (Original Tacotron used loss mask)
- Remove residual connection in decoder_CBHG
- Batch size is set to 8
- Gradient clipping
- Noam style learning rate decay (The mechanism that Attention is all you need applies.)
- (Tensorflow) Kyubyong's implementation
- (Tensorflow) acetylSv's implementation
- (Pytorch) soobinseo's implementaition
Finally, I have to say this work is highly based on Kyubyong's work, so if you are a tensorflow user, you may want to see his work. Also, feel free to give some feedbacks!