GST-tacotron

Reproducing: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (https://arxiv.org/pdf/1803.09017.pdf)

Python and Toolkit Version

Python:      '3.5.2'
numpy:       '1.13.1'
tensorflow:  '1.4'

Samples and Pretrained Models

Samples could be found here, where two kind of experiments were conducted:

Conditioning on reference audio:
- BZ_440K.wav is an inference result from model trained on Blizzard2013 for 440K steps (batch_size=16), the conditioned referecne audio is picked from its testing set.
- LJ_448K.wav is another inference result from model trained on LJ_Speech for 448K steps (batch_size=16), the conditioned referecne audio is also picked from its testing set.
Combinations of GSTs:
- normal.wav and slow.wav are two inference results from model trained on LJ_Speech, the difference between the two is by picking difference style tokens for style embedding.
- high.wav and low.wav is another pair of example.

Pretrained models on both datasets could be downloaded here. (md5sum:f50940f500c35457cb3d8f8d041240fe) Note that the detailed settings are different and listed below:

pretrained_model_BZ:
- n_fft:1024
- sample_rate: 16000
pretrained_model_LJ:
- n_fft:2048
- sample_rate: 22050

Steps and Usages

Data Preprocess:
- Prepare wavs and transcription
- Example format:

Blizzard_2013|CA-MP3-17-138.wav|End of Mansfield Park by Jane Austin.
Blizzard_2013|CA-MP3-17-139.wav|Performed by Catherine Byers.
...

Make TFrecords for faster data loading:
- Check parameters in hyperparams.py
  - path informations
  - TFrecords partition number
  - sample_rate
  - fft points, hop length, window length
  - mel-filter banks number
- Run:

python3 make_tfrecords.py

Train the whole network:
- Check log directory and model, summary settings in hyperparams.py
- Run:

python3 train.py

Evaluation while training:
- (Currently only do evaluation on the first batch_size data)
- (Decoder RNN are manually "FEED_PREVIOUS" now)
- Run:

python3 eval.py

Inference:
- Check Inference input text in hyperparams.py
- Pass reference audio path as argument
- Reference audio: an arbitary .wav file
- Directly condition on combination of GSTs is now undergoing, set below flag in infer.py condition_on_audio = False and set the combination weight you like
- Run:

python3 infer.py [ref_audio_path]

Inference input text example format:

0. Welcome to N. T. U. speech lab
1. Recognize speech
2. Wreck a nice beach
...

Notes

At experiments 6-1 and 6-2, paper points out that one could SELECT some tokens, scale it and then feed this style embedding into text encoder. But at section 3.2.2, multi-head attention is used and each token is set to be 256/h dim. If so, at inference time, a selected token should have 256/h dim, but the text encoder should be fused with a 256 dim vector. And also, if one choose to use multi-head attention, then the style embedding will become some vectors'(which is the attention result of each head) concatenation passing through a linear network rather than GST's weighed sum. I do not really understand that if this is the case, can one simply SELECT some GST's combination to represent style embedding or not.
Input phone seqs did not give better results or faster convergence speed.
Dynamic bucketing with feed_previous decoder RNN seems not possible? (tf.split, tf.unstack, tf.shape, get_shape().as_list(), slicing... not seems to work)

TODO

Find an adequate dataset (Blizzard 2013)
(Failed) Implement feed_previous function in decoder RNN
Input phone seqs instead of character seqs
WaveNet vocoder

acetylSv / GST-tacotron