Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Implementation Status and planned TODOs

Rayhane-mamah opened this issue · comments

this umbrella issue tracks my current progress and discuss priority of planned TODOs. It has been closed since all objectives are hit.

Goal

  • achieve a high quality human-like text to speech synthesizer based on DeepMind's paper
  • provide a pre-trained Tacotron-2 model (Training.. checking this still)

Model

Feature Prediction Model (Done)

  • Convolutional-RNN encoder block
  • Autoregressive decoder
  • Location Sensitive Attention (+ smoothing option)
  • Dynamic stop token prediction
  • LSTM + Zoneout
  • reduction factor (not used in the T2 paper)

Wavenet vocoder conditioned on Mel-Spectrogram (Done)

  • 1D dilated convolution
  • Local conditioning
  • Global conditioning
  • Upsampling network (by transposed convolutions)
  • Mixture of logistic distributions
  • Gaussian distribution for waveforms modeling
  • Exponential Moving Average (train + synthesis)

Scripts

  • Feature prediction model: training
  • Feature prediction model: natural synthesis
  • Feature prediction model: ground-truth aligned synthesis
  • Wavenet vocoder model: training (ground truth Mel-Spectrograms)
  • Wavenet vocoder model: training (ground truth aligned Mel-Spectrograms)
  • Wavenet vocoder model: waveforms synthesis
  • Global model: synthesis (from text to waveforms)

Extra (optional):

  • Griffin-Lim (as an alternative vocoder)
  • Reduction factor (speed up training, reduce model complexity + better alignment)
  • Curriculum-Learning for RNN Natural synthesis. paper
  • Post processing network for Linear Spectrogram mapping
  • Wavenet with Gaussian distribution (reference)

Notes:

All models in this repository will be implemented in Tensorflow on a first stage, so in case you want to use a Wavenet vocoder implemented in Pytorch you can refer to this repository that shows very promising results.

Just putting some notes about the last commit (7e67d8b) to explain the motivation behind such major changes and to verify with the rest of us that I didn't make any silly mistakes ( as usual.. )

This commit mainly had 3 goals: (other changes are minor)

  • Clean the code: added some comments and changed the code architecture to use tensorflow's attention wrapper in the objective of reducing the number of files. Even though I tried getting rid of the "custom_decoder" and the custom "dynamic_decode" I'm currently using, after diving deep into tensorflow's implementation, I found that it was impossible to adapt my dynamic <stop_token> prediction to use tensorflow's ready to use "BasicDecoder" and "dynamic_decode" with my custom helpers.
  • Correct the Attention: Even though they called it "Location sensitive attention" in the paper, they didn't mean the "location based attention" we know but instead, they were mentioning the "hybrid" attention. For this hypothesis i'm relying on this part of the paper "We use the location sensitive attention from [21], which extends the additive attention mechanism [22] to use cumulative attention weights from previous decoder time steps as an additional feature." which supposes they took the original bahdanau attention and added location features to it."
  • Added "map" (log) file at synthesis: Mainly, this will map each input sequence, to the corresponding real Mel-Spectrograms and generated ones.

I also want to bring attention to these few points (in case someone want to argue them):

  • I impute finished sequences at decoding time to ensure model doesn't have to learn to predict paddings (which will probably result in extra noise in generated waveforms later)
  • If I'm not mistaking, the paper writers used the projection to a scalar+sigmoid to explicitly predict a "<stop_token>" probability since our feature prediction model isn't performing some classification task where he can choose to output a real <stop_token>. I like to think of it as creating a small binary classifier that chooses when to stop decoding since vanilla decoder can't output a frame with full round zeros.
  • I am only using 512 LSTM units for decoder in each layer as i supposed "The pre-net output and attention context vector are concatenated and passed through a stack of 2 uni-directional LSTM layers with 1024 units." means that the 1024 are distributed across the 2 layers.

Hi @Rayhane-mamah, using 7e67d8b I got an error that (in the end). You change he call -parameter name from previous_alignments to state in attention.py:108

Was that on purpose? AttentionWrapper from TF requires the parameter to be named previous_alignments. (Using TF 1.4)

Changing that back to previous_alignments results in other errors:

ValueError: Shapes must be equal rank, but are 2 and 1 for 'model/inference/decoder/while/BasicDecoderStep/decoder/output_projection_wrapper/output_projection_wrapper/concat_lstm_output_and_attention_wrapper/concat_lstm_output_and_attention_wrapper/multi_rnn_cell/cell_0/cell_0/concat_prenet_and_attention_wrapper/concat_prenet_and_attention_wrapper/attention_cell/MatMul' (op: 'BatchMatMul') with input shapes: [2,1,?,?], [?,?,512].

Any ideas?

Hi @imdatsolak, thanks for reaching out.

I encountered this problem on one of my machines, updating tensorflow to latest version solved the problem. (I changed the parameter to state according to latest tensorflow attention wrapper source code, I also want to point out that I am using TF 1.5 and confirm that attention wrapper works with "state" for this version and later).

Try updating tensorflow and keep me notified, I'll look into it if the problem persists.

@Rayhane-mamah, I tried with TF 1.5, which didn't work. Looking into TF 1.5, the parameter was still called previous_alignments. The parameter's name changed in TF 1.6 to state, so installed TF 1.6 and it works now. Thanks!

Upgrading to TF1.6 (was 1.5) solved issue (TypeError: call() got an unexpected keyword argument 'previous_alignments') for me.

@imdatsolak, yes my bad. @danshirron is perfectly right. I checked that my version is 1.6 too (i don't remember updating it Oo)

Quick notes about the latest commit (7393fd5):

  • Corrected parameters initialization which was causing gradients explosion in some cases (using a xavier initializer)
  • Added gradients norm visualization
  • Replaced the learning rate decay to start from step 0 (instead of 50000) and added a visualization of the learning rate
  • Corrected typos in "hparams.py"
  • Changed alignment plot directories and added real + predicted Mel-Spectrogram plots (each 100 training step)
  • Added a small jupyter notebook where you can use griffin-lim to reconstruct phase and listen to the audio reconstructed from generated Mel-spectrograms (just to control the model learning state without paying much attention to audio quality as we will use wavenet as a vocoder)
  • Started using a reduction factor (despite not being used in tacotron-2) as it speeds training process (faster computation) and allows for faster alignment learning. (current: r=5, feel free to change it).
  • Corrected typos in preprocessing (Make sure to restart the preprocessing before training your next model)

Side notes:

  • Alignment should appear at step 15k and audio becomes quite audible at 4~5k steps (using 32 batch size) but fully understandable around 8~10k steps.
  • Mel-spectrograms seem very blurry at the beginning, and despite the loss not decreasing much (you may even feel it's constant after 1k steps) the model will still learn to improve speech quality so be patient.

If there are any problems, please feel free to report them, I'll get to it as fast as possible

Quick review of the latest changes (919c96a):

  • Global code reorganization (for easier modifications and it's just cleaner now)
  • Network Architecture review: Since there are some unclear points in the paper, I am doing my best to collect enough information from all related works, and trying to put them all together to get reasonable results. The current architecture is the closest I got to the described T2 (I think.. ^^')
  • Pulled the <stop_token> prediction out of the decoder and got rid of the custom "dynamic_decode".
  • Reduced the model size and added new targets (stop token targets are now prepared in the feeder)
  • Adapted <stop_token> prediction to work properly with the reduction factor. (multiple <stop_token> predictions at each decoding step)
  • Doubled the number of LSTM units in the decoder and number of neurones in the prenet. On the other hand, I removed the separate attention LSTM and started using the first decoder LSTM hidden state as a query for the attention.

Side Notes:

  • Despite slightly reducing the memory usage of the model, impact of training speed are still not clear enough. Forward propagation got slightly faster and back propagation slightly slower. But the overall speed seems the same.

If anyone tries to train the model, please think about providing us with some feedback. (especially if the model needs improvement)

commented

Hi @Rayhane-mamah, thanks for sharing your work.

I cannot get a proper mel-spectrogram prediction and audible wave by Evaluation or Natural synthesis(No GTA) at step 50k.
All hparams are same with your code(with LJSpeech DB) and wave are generated by mel-prediction, mel_to_linear, Griffin-Lim reconstruction.
GTA synthesis generates audible results.

Is it works in your experiments?

I attached some Mel-spectrogram plot samples with following sentences.

1 : “In Dallas, one of the nine agents was assigned to assist in security measures at Love Field, and four had protective assignments at the Trade Mart."

Ground Truth
image

GTA
image

Natural(Eval)
image

2 : ”The remaining four had key responsibilities as members of the complement of the follow-up car in the motorcade."

Ground Truth
image

GTA
image

Natural(Eval)
image

3 : “Three of these agents occupied positions on the running boards of the car, and the fourth was seated in the car."

Ground Truth
image

GTA
image

Natural(Eval)
image

Hello @ohleo, thank you for trying our work and especially for sharing your results with us.

The problem you're reporting seems the same as the one @imdatsolak mentionned here.

There are two possible reasons I can think of right now:

  • Your model after 50k steps still has an ugly allignment (hopefully this commit takes care of that). That's the most probable reason i think.
  • I am unknowingly and indefinitely passing the first frame to the decoder in my code. I will triple check this today ( in case TacotestHelper is the cause).
  • It can't be possibly doing a massive overfit on the first generated frame can it ? Oo the output looks the same for the three sentences!

The fact that GTA is working fine highly supposes the problem is in the helper.. I will report back to you later tonight.
If your setup is powerful enough, you could try to retrain the model using the latest commit or wait for me to test it myself a bit later this week.

In all cases, thanks a lot for your contribution, and hopefully we get around this issue soon.

Hello, @Rayhane-mamah ,

do you get any further information running using the latest code?

Hello @unwritten, thanks for reaching out.
I believe you asked about GTA as well? I'm just gonna answer it anyway in case anyone gets the same question.

GTA stands for Ground Truth Aligned. Synthesizing audio using GTA is basically using teacher forcing to help the model predict Mel-spectrograms. If you aim to use the generated spectrograms to train some vocoder like Wavenet or else, then this is probably how you want to generate your spectrograms for now. It is important to note however that in a fully end-to-end test case, you won't be given the ground truth, so you will have to use the "natural" synthesis where the model will simply look at its last predicted frame to output the next one. (i.e: with no teacher forcing)

Until my last commit, the model wasn't able to use natural synthesis properly and I was mainly suspecting the attention mechanism because, well, how is the model supposed to generate correct frames if it doesn't attend to the input sequence correctly.. Which brings us to your question.

So after a long week end of debugging, it turned out that the attention mechanism is just fine, and that the problem might have been with some Tensorflow scopes or w/e.. (I'm not really quite sure what was the problem). Anyway, after going back through the entire architecture, trying some different preprocessing steps and replacing zoneout LSTMs with vanilla LSTMs, the problem seems to be solved (Now I am not entirely 100% sure as I have not yet trained the model too far, but things seem as they should be in early stages of training).

I will update the repository in a bit (right after doing some cleaning), and there will be several references to papers used that the implementation was based on. These papers will be in pdf format in the "papers" folder, like that it's easier to find if you want to have an in depth look of the model.

I will post some results (plots and griffin lim reconstructed audio) as soon as possible. Until then, if there is anything else I can assist you with, please let me know.

Notes:

  • It is possible for now to use the griffin lim algorithm (using the provided notebook) to do a basic inversion of the mel spectrogram to waveform. The quality won't be as good as Wavenet's but it's mainly for test and debug purposes for now.
  • Generated spectrograms using the "synthesize.py" will be stored under "output" folder. Depending on the synthesis mode you used, there will many possible sub-folders.
  • I have not yet added the Wavenet vocoder to this repository as there are more important things at the moment like ensuring a good spectrograms generation. There are good Wavenet implementations out there that are conditioned of Mel-spectrograms like r9y9/wavenet.

Hello again @unwritten.

As promised I pushed the commit that contains the rectifications (c5e48a0).

Results, samples and pretrained model will be coming shortly.

@Rayhane-mamah

Results, samples and pretrained model will be coming shortly.

Trying to understand "shortly", do you think they'll be out today, next week or next month?

@PetrochukM, I was thinking more like next year.. that still counts as "shortly" I guess..

Enough messing around, let's say it will take a couple of days.. or a couple of weeks :p But what's important, it will be here eventually.

Hi everybody, here is a new dataset that you can use to train Speech Recognition and Speech Synthesis: M-AILABS Speech Dataset. Have fun...

@Rayhane-mamah thanks for the work;
I have tried to train the latest commit maybe before 81b657d, I pulled the code about 2 days ago, currently it run to about 4k, the align doesn't look like to be there, I will try the newest code though:
step-45000-pred-mel-spectrogram
step-45000-real-mel-spectrogram

step-45000-align

Hi @imdatsolak, thank you very much for the notification. I will make sure to try it out as soon as possible.

@unwritten, I experienced the same issue with the commit you're reporting.

If your really don't want to waste your time and computation power for failed tests, you could wait a couple of days (at best) or a couple of weeks (at worst) until I post a 100% sure to work model, semi-pretrained which you can train further for better quality (I don't have the luxury to train for many steps at the moment unfortunately).

Thank you very much for your contribution. If there is anything I can help you with or if you notice any problems, feel free to report back.

@Rayhane-mamah thanks for the work;
why Loss descends very quickly than tacotron1?

Hello @maozhiqiang, thank you for reaching out.

In comparison to Tacotron-1 which uses simple summed L1 loss function (or MAE), we use (in Tacotron-2) a summed L2 loss function (or MSE). (The sum is in both cases is of predictions before and after the postnet). I won't pay much attention to the average along batch here for simplicity.

Let's take a look at both losses: (h(xi) stands for the model estimation)

L1 = ∑i |yi−h(xi)|
L2 = ∑i (yi−h(xi))²

The L1 loss is typically computing the residual loss between your model's predictions and the ground truth and returning the absolute value as is. The L2 loss however squares this error for each sample instead of simply returning the difference loss.
Now consider that your model starts for an initial state t0 where weights and biases are initialized randomly. Naturally the first model output will be totally random which results in a high loss for L1 which is even more amplified by the square operation in L2 (I'm supposing the initial loss is greater than 1).
After a few steps of training, the model should start emitting outputs that are in range of the correct predictions (especially if your data is [0, 1] normalized like in our case, the model doesn't take long to start throwing outputs in that range). This can be detected by the blurry, yet seem close to real, spectrograms the model is providing each 100 steps.
At this stage, the L1 and L2 loss functions start showing very different values. Try taking a difference (yi - h(xi)) smaller than 1 and compute its square, naturally you will get an even smaller value. So, once the model starts giving outputs in the correct range, L2 loss is already very low in comparison to L1 loss which does not compute the square.

Note: Next the model will only have to improve the vocal patterns, which consist of small adjustments, which explains why the loss then starts decreasing very slowly.

So mainly, what I'm trying to point out here is that we are not using the same loss function as in Tacotron-1 which I believe is the main reason for such difference. However, there are some other factors like the difference in the model architecture, or even the difference in target itself. (In Tacotron-1, we predict both Mel-spectrogram and Linear spectrograms using the post-processing net).

I believe this answers your question? Thanks again for reaching out, if there is anything else I can assist you with, please let me know.

hello @Rayhane-mamah thanks for your detailed reply,
I started training with your code these days,
Here's my training figure
step-27000-align
step-27000-pred-mel-spectrogram
step-27000-real-mel-spectrogram
When I run more than one hundred thousand times, the difference between pred-mel and real-mel is still great,but loss More than 0.03 or smaller,
Is there any problem in this?
Look forward to your reply ,thank you

Here is empirical evidence for @Rayhane-mamah 's reasoning.

default

Yellow line uses loss function of tacotron1, brown line uses loss function of tacotron2. Loss of brown is about square of loss of yellow. (and they intersect at 1.0!)

Hello.
I'm working on Tacotron2, and worked based on Keithito's implementation. Recently, I am trying to move to your implementation for some reasons.

There is one fundamental difference between @Rayhane-mamah 's TacotronDecoderCell and tensorflow.contrib.seq2seq.AttentionWrapper which Keithito used. AttentionWrapper uses previous output (mel spectrogram) AND previous attention(= context vector), but yours only use previous outputs.

With my modified version of Keithito's impl can make proper alignment, but yours cannot (Or just your impl requires more steps to make good alignment). I suspect the above mentioned difference for this result.

(One strange behavior of your implementation is that the quality of synthesized samples on test set is quite good, though their alignments are poor. With Keithito's implementation, without proper alignment, test loss is really huge.)

Do you have any idea about this? (Which one is right, concatenating previous attention or not?)

hello @maozhiqiang and @a3626a , Thank you for your contribution.

@maozhiqiang, The loss you're reporting is perfectly normal, actually the smaller the loss the better, which explains why the further your train your model the better the predicted Mel-spectrograms become.

The only apparent problem which is also reported by @a3626a, is that the actual state of the repository (the actual model) isn't able to capture a good alignment.

@maozhiqiang, alignments are supposed to look something like this:
step-25000-align

Now, @a3626a, about that repository comparison, I made these few charts to make sure we're on the same page, and to make it easier to explain (I'm bad with words T_T).

Please note that for simplicity purposes, the encoder outputs call, the <stop_token> prediction part and the recurrent call of previous alignments were not represented.
If your notice any mistakes, please feel free to correct me:

Here's my understanding on how keithito's Decoder works:
tacotron-1-decoder

The way I see it, he is using an extra stateful RNN cell to generate the query vector at each decoding step (I'm assuming this is based on T1 where 256-GRU is used for this purpose). He's using a 128-LSTM for this RNN.

As you stated, the last decoder step outputs are indeed concatenated with the previous context vector before feeding them to the prenet (this is automatically done inside Tensorflow's attention_wrapper).
Please also note that in the "hybrid" implementation keithito is using, he is not concatenating the current context vector with the decoder RNN output before doing the linear projection. (just pointing out another difference in the architecture).

Now, here's what my decoder looks like:
tacotron-2-decoder

In this chart, the blue and red arrows (and terms in equations) represent two different implementations I tried separately for the context vector computation. Functions with the same name in both graphs represent the same layers (look at the end of the comment for a brief explanation about each symbol).

The actual state of the repository is the one represented in blue. i.e: I use the last decoder RNN output as query vector for the context vector computation. I also concatenate the decoder RNN output and the computed context vector to form the projection layer input.

Now, after reading your comments (and thank you for your loss plot by the way), Two possible versions came to mind when thinking of your modified version to keithito's tacotron:

First and most likely one:
In case you used Tensorflow's attention_wrapper to wrap the entire decoder cell, then this chart should probably explain how your decoder is working:
tacotron-hypothesis-1-decoder

here I am supposing that you are using the previous context vector in the concatenation operations. (c_{i-1}) and then update your context vector at the end of the decoding step. This is what naturally happens if you wrap the entire TacotronDecoderCell (without the alignments and attention part) with Tensorflow's attention_wrapper.

Second but less likely one:
If however you did not make use of the attention_wrapper, and do the context vector computation right after the prenet, this is probably what your decoder is doing:
tacotron-hypothesis-2-decoder

This actually seems weird to me because we're using the prenet output as a query vector.. Let's say i'm used to provide RNN outputs as query vector for attention computation.

Is any of these assumptions right? Or are you doing something I didn't think of? Please feel free to share your approach with us! (words should do, no need for charts x) )

So, to wrap things up (so many wrapping..), I am aware that generating the query vector using an additional LSTM gives a proper alignment, I am however trying to figure out a way that doesn't necessarily use an "Extra" recurrent layer since it wasn't explicitly mentioned in T2 paper. (and let's be honest, I don't want my hardware to come back haunting me when it gets tired of all this computation).

Sorry for the long comment, below are the symbols explained:

  • p() is a multi-layered non-linear function (prenet)
  • e_rec() stands for Extra Recurrency (attention LSTM)
  • Attend() is typically the attention network (refer to (content+location) attention paper for developed formulas)
  • rec() is the decoder Recurrency (decoder LSTM)
  • f() is a linear transformation
  • p_y_{i}, s_{i}, es_{i}, y_{i}, a_{i} and c_{i} are the prenet output, decoder RNN hidden state, attention RNN hidden state, decoder output, alignments and context vector respectively (all at the i-th step).
  • h is the encoder hidden states (encoder outputs)

Note:
About the quality of synthesized samples on test set, I am guessing you're referring to the GTA synthesis? It's a little bit predictable since GTA is basically a 100% teacher forcing synthesis (we provide the true frame instead of the last predicted frame at each decoding step). Otherwise, (for natural synthesis), the quality is very poor without alignment.

Most of all, thank you for your reply with nice diagrams.

  1. About quality of samples on test set.
    Though I have not tested, you are probably right. Teacher forcing was enabled in my system.

  2. About my implementation
    My implementation's structure is almost identical to Keithito's. I mean 'modified' for adding more regularization methods, speaker embedding, different language with different dataset.

  3. My future approach
    I will follow your direction, getting rid of extra recurrent layer for attention mechanism. In my opinion, 2-layer decoder LSTMs can do the job of extra recurrent layer. I think what to feed into _compute_attention is the key, which is not clear in the paper. (Like you did, as red arrow and blue arrow)
    For the start, I will feed 'previous cell state of first decoder LSTM cell'. There are 2 reason for this choice. First, I am expecting the first LSTM cell to work as an attention RNN. Second, it seems like better to feed cell state, not hidden state(output). Because, it does not require unnecessary transformations of information. In other words, hidden state(output) of LSTM cell would be more like spectrogram, not phonemes, so this must be converted back into phoneme-like data to calculate energy(or score). In contrast cell state can have phoneme-like data which can be easily compared to encoder outputs (phonemes)

Hello again and thank you for your answers.

"Speaker embedding" sounds exciting. I'm looking forward to hearing some samples once you're done making it!

About the attention, this is actually a nice interpretation! I can't test it out right now but I will definitely do! If you do try it out please feel free to share your results with us.

Thanks again for your contributions!

I'm testing feeding 'previous cell state of first decoder LSTM cell', I will share the result after 1-2 days.

Thank you.

Wow, nice thread;) I will follow the discussion here and would like to look into your code. Thank you for sharing your work!

First, I attached results below. In conclusion, "feeding first LSTM's cell state" does not work. According to 'Attention-Based Models for Speech Recognition', one RNN can produce output and also context vector.(or glimpse) Therefore, I think it is possible to get rid of extra RNN in Keithito's implementation(or Tacotron1).

For the next trials, 1) I will feed last LSTM's cell state, 2) I will set the initial states of decoder LSTMs as trainable parameters, not zeroes. This is mentioned in 'Attention-Based Models for Speech Recognition'.

  • Below results are came from my modified versions of Keithito's Tacotron2 and Rayhane-mamah's Tacotron2. No guarantee for repeatable results with the original implementations.

  • all results are produced by models that use reduction_factor=5. I have not succeeded to train nice alignment with reduction_factor=1, with any implementation. Unfortunately, reduction_factor seems like very important to integrate to WaveNet vocoder. It is because, I succeeded to train WaveNet vocoder with mel spectrograms from target wavs, but failed with mel spectrograms generated from ground-truth alignment from Tacotron1 with reduction_factor=5. I concluded that generated mel spectrogram could not be aligned well because of high reduction_factor.

  • Dataset is Korean. Encoder/Decoder axises are switched each other.

  1. Keithito's (+more regularization methods listed in the Tacotron2 paper)
    keithito

  2. Rayhane's
    rayhane

  3. Rayhane's with 'feeding first cell state'
    rayhane

@a3626a , will you share your modified taco2 repo? agree that training r=1 is hard on Keithito's repo, I assume you are training soft attention, do you ever try hard attention?

  1. I won't share my code, but structures, hyper-parameters, and generated samples can be shared.

  2. I'm focusing reproducing Tacotron2, so I am training soft attention like Tacotron2. I have not tried hard attention on TTS.

Hello everyone, good news, it's working and it's faithful to the paper! (d3dce0e)

step-3400-align

First of all, sorry for taking too long, I have been dealing with a leak in my watercooling so I wasn't really able to do much work in the past few days.. (it's 31 March, 11:50 PM where I live so technically it's still not the end of the month yet.. So, I'm right on time :p )
Hello @r9y9 , thanks for joining us, I wonder where I got the idea of this open discussion from. :p

Anyways, as you can see in previous plot, the attention, or the entire model actually, is working correctly now (hopefully). For the Ljspeech dataset, alignments start appearing at 2k steps, and are practically learned at 3k steps. They are still however a little noisy for long sentences but I expect them to become better with further training. Actually you can even notice that, at early stages, the model is paying more attention to areas around the second diagonal than to the rest of the matrix. (refer to the drive down below)
As for the speech quality, we start understanding some words even before 1k steps but the overall audio isn't quite audible. At 3k4 steps, audio is pretty understandable, but outputs are still noisy, and more training is needed to get noise free outputs. (samples down below)

In the next few hours, I will release a document containing an in depth explanation of most of what is implemented in this repository. It's mainly for anyone who wants to have an in depth understanding of the network, or wants to know what's going on exactly in order to adapt this implementation to similar work or maybe build on it! (@a3626a it will cover the explanation about my current attention modifications based not only on Luong and Bahdanau papers, but also on Tensorflow attention tutorial.)

I do however want to draw your attention to some key points you need to be aware of before training your next model:

  • Mel data distribution: First, if you preprocess the ljspeech dataset using the last commit, you will notice that data distribution has changed. In fact I modified the Normalization function in the preprocessing to rescale the targets differently. In the hparams.py, you will notice symmetric_mels and max_abs_value which will let you choose how to rescale your targets. If symmetric is set to True, typically your targets will be distributed across [-max_abs_value, max_abs_value], else it will take values in [0, max_abs_value]. Defaults to [-4, 4]. I know that such choice may seem arbitrary, in depth explanation will be provided in the document, but this maneuver proved to speed up training at early stages by putting more penalization on the model mels outputs, mainly due to the nature of the loss function. (consult mel spectrograms provided down below).

  • Wavenet integration: Due to the preceding point, taking the T2 tacotron directly to some pretrained Wavenet model like r9y9/wavenet is impossible. There are possible solutions for that actually. One can simply rescale the mel outputs to adapt to the pretrained Wavenet. In the default case, shifting the data to [0, 8] and rescaling it to [0, 1] should do the trick, and I don't think it will affect the quality of the output in any way. If you do not have a pretrained Wavenet, I would suggest to train the model directly on the T2 output, @r9y9 can confirm if it's possible.

  • Training effectiveness: I also want to point out that the choice of the mel scaling is highly dependent on the data and especially on the language at hand. One can even choose to not normalize the data at all which I personally don't recommend unless you also change the regularization weights to avoid bias-ing (is that even a word..) your model. I also added a wav generation at each checkpoint step (stored under logs-Tacotron/wavs), I recommend listening to those and keep an eye on the alignments to know if your model is going the right way.

  • Batch size: We also experienced a case where the model was not able to capture attention when trained with a small batch size (8). This is most likely due to the noisy gradients that come with such small batch size, so we recommend that you train your models with a batch size of 32 or optimally 64.

Sorry for the long comment as always, I will simply finish by giving a heads up on what's coming up next:

Immediate:

  • Full integration and support for the M-AILABS dataset proposed by @imdatsolak. I have to admit that this is an amazing work you've done sir, and the female english data sounds awesome! Thank you very much for this large speech corpus!
  • Reorder the repository to prepare for Wavenet integration (two models link code, train and test pipelines, etc.)

Right After:

  • First attempt at integrating Wavenet as a vocoder for a human like TTS quality. First versions will mainly be an adaptation of r9y9/wavenet work to Tensorflow. This decision is due to the very promising results this repository has achieved so far.
  • Provide a pretrained Feature prediction Model.

Optional:

  • Add an optional post processing network that maps predicted Mel Spectrograms to Linear Spectrograms. The only purpose of adding those few layers would be to use the linear spectrograms as inputs to the griffin-lim inversion algorithm since it yields better quality than inverting mels due to their lossy nature. This will mainly be based on keithito's work.

Finally, here is the semi-trained T2 model (if we can call it that) used to generate the previous picture with all equivalent plots and wavs generated during training. I was not able to train the model much longer since my machine is actually disabled.. The model was trained on CPU so don't pay much attention to training time, it should go way faster with a GPU!
Logs folder provided should also give you the ability to consult some Tensorboard stats.

If you do train a model using my work, please feel free to share your plots, observations, samples and even trained models with any language you like!

In case you encounter any problems, please notify me, I will get right to it!

@Rayhane-mamah, do you try outputs_per_step = 1? I can get alignment when outputs_per_step = 5, but outputs_per_step = 1 can't, does this mean long sequence is too hard to train?

@unwritten, No I have not tried that out yet but I am most interested in knowing the answer.

If with the last commit, alignments are not learned with a reduction factor of 1, the first thing to suspect is indeed the sequence length. In that case, one could try to mask the input paddings and impute output paddings (the relevant hparams are provided).

As for the impact of the reduction factor on the wavenet quality, I am not sure if it is related or not. Working on the issue is a top priority so I will keep you informed. @r9y9, are you aware of anything like this?

In any way thank you guys for reporting this, hopefully we find a way around.

@a3626a, out of curiosity, for how long did you train the tacotron-1 model which outputs were used on the wavenet?
And the failure test case you're refering to, is it related to the wavenet not being able to reproduce a high quality audio or is it related to the model loosing all language/vocal information and emitting random high quality voices?

  1. Steps of Tacotron-1
    Between 100,000~200,000 with batch size 32, 100 speakers, 1h for each speaker. Audio quality after Griffin-Lim was not bad, perfectly audible, but little noisy (like samples from paper).

  2. WaveNet loses all vocal information.
    However, feeding mel spectrogram generated from target waveform works well. (Everything except input of WaveNet was same.)
    I am sure that ground truth alignment (or teacher-forcing) was enabled during training.

@a3626a, I see, I will try to reproduce the issue as soon as possible and tell you how it goes.

EDIT:
@a3626a, you said teacher-forcing was enabled during training. what about synthesis time? did you visualize the predicted mels? did you generate the outputs by feeding previously generated frames back to the decoder? If that's the case, is it possible to try synthesizing new mels with teacher forcing (Like the GTA option in my repo).

Quick observations sharing:

All results down below are generated using a T2 model from this repository trained on Ljspeech dataset for 6k4 steps and are prone to become much better with further training!
All results down below are generated in a Natural mode ("eval" mode) with no teacher forcing, on test sentences absent from the training data! (check hparams.py)
At this stage in training (which is considered still early), the model still has some pronunciation issues (e.g: the cases of "I" and "Y") (check temporary audio samples).
All equivalent sentences are written inside the plots.

Let's start with something simple:

  • Sentence 1:
    ljspeech-alignment-00002
    ljspeech-mel-00002

Sentence 2:
ljspeech-alignment-00003
ljspeech-mel-00003

Despite being dependent on previous cumulative alignments, the model managed to make a good alignment even with no ground truth feeding, even without looking at the Mels or listening to the wav, one could deduce that the model is probably emitting a nice output.

Spectrogram plots generated during this evaluation are very similar to training spectrograms at 6k4 steps (which is used for the evaluation).

In the next examples, I want to bring your attention to the "extra" silence the model is emitting for no visible reason in the input sequence. This is probably due to the reading style in the dataset recordings:

  • Sentence 1:
    ljspeech-alignment-00006
    ljspeech-mel-00006

  • Sentence 2:
    ljspeech-alignment-00022
    ljspeech-mel-00022

Next, we evaluate the model on punctuation sensitivity:

  • Sentence 1:
    ljspeech-alignment-00012
    ljspeech-mel-00012

-Sentence 2:
ljspeech-alignment-00013
ljspeech-mel-00013

It's pretty visible that the model is simply adding some silence and attributing attention to the same token "," for multiple decoding steps when present.

Next, I wanted to check the scalability of the model on very long sequences (which explains why extended max_iters of the decoder to 1000 (just for safety in case of an infinite loop)):
ljspeech-alignment-00030
ljspeech-mel-00030

The overall output is acceptable, you can however notice that at some point the model looses the attention and skips some fragments in "add this last". Training might solve the issue of attention for long sequences, but I am also thinking about implementing the attention windowing discussed here which not only reduces computation (which accelerates the model), but also limits the number of input token the model attends to at each decoder step, making the task of attention a little easier. In other words, this will give a very rough estimate of the desired alignments, thus bringing the model in the correct range faster. This has been discussed in-depth in this speech recognition paper.

Finally, knowing that all presented results are raw outputs of the model and not clipping was made, we can notice that the model is doing very well at predicting when to stop generation. A small detail is the last predicted frame in the mel spectrograms, you can see that it looks a lot like the padding frames in the training mel spectrograms. The trick was to not impute finished sentences, allowing the model to learn when to output "padding-like" frames and thus predicting the <stop_token> correctly. Imputing the decoder finished sequences might make the <stop_token> prediction a little more challenging.

Unfortunately, the model has not learned the difference between nouns and verbs or past and present (yet?). There's no much to see in the alignments or mel plots actually, but you can notice the failure case when listening to the wavs. Whether the model will eventually learn it or not highly depends on the dataset. The same applies for Capital vs small letters.

So, as a conclusion, I just wanted to point out that the repetitive frame output case (reported by @imdatsolak and @ohleo) is solved once the model knows where to "attend" when generating, and the actual results are very promising and a fully trained model should do well.

EDIT:
@a3626a, you said teacher-forcing was enabled during training. what about synthesis time? did you visualize the predicted mels? did you generate the outputs by feeding previously generated frames back to the decoder? If that's the case, is it possible to try synthesizing new mels with teacher forcing (Like the GTA option in my repo).

what about synthesis time?

  • GTA is not enabled, generated frames are fed back.

did you visualize the predicted mels?

  • No I haven't but I heard those samples generated. 1) WaveNet which is fed spectrogram from Tacotron1 ignores all local conditions, so it sounds like vanilla WaveNet which doesn't use local conditions. It babbles. 2) WaveNet which is fed spectrogram from target audio sounds okay.

If that's the case, is it possible to try synthesizing new mels with teacher forcing

  • I think it is worth to try.

Hi @Rayhane-mamah , Could you guide me to run the training script in GPU mode ? Currently It is using just the CPU and not utilizing the GPUs .

@imdatsolak, After adding the M-AILABS speech corpus support, I noticed some missing wavs despite the presence of their titles inside the csv metadata (en_US version). I thought that it might interest you to know that. Running the preprocessing script of (240ccf8) will give you all the missing files names (as log on the terminal).
Also, if you find the time, I would appreciate it a lot if you could verify that the language codes I am supporting conform to their equivalent folders names in your corpus. Thank you very much in advance.

@a3626a, Personally, I would suspect the Tacotron-1 to be failing at synthesis time.. Visualizing the predicted spectrograms is a great way of debugging this (I added this here). Other useful things we could try are GTA synthesis and try some toy griffin lim inversion. If inverted spectrograms are audible but become babbling when used on Wavenet, then there is 100% some compatibility issue somewhere, otherwise, the Spectrogram Prediction Model itself is having problems. Please keep me informed of any tests you make in case I can be on any assistance!

@ferintphilipose Hello and thank you for reaching out!
By default, my implementation works on GPU if Tensorflow-gpu is installed correctly and drivers are in the correct version and working. Does your Tensorflow use GPU for other projects and not for this one?

If not, There is the installation tutorial for all supported OS, and you can follow up with this quick tutorial I made with pictures inside a Jupyter Notebook.

If you are 100% sure all your installations and CUDA drivers are on point, if you have installed your Tensorflow-gpu inside a virtual environment, please make sure to activate it when you want to run projects on GPU.

Other than that, I am not really sure what can be the problem, as the project works perfectly on my GPU.. If you however find any more information, feel free to share in case I can be of any assistance.

@Rayhane-mamah, I will check for the audio files as you mentioned. There may be, indeed, missing ones but we will also do additional QA on that.

Regarding language-codes: we always use lang_Country, e.g. ru_RU, uk_UK.

In the language codes list you are using, you would probably need to change these:
es-ES => es_ES
ru-RU => ru_RU
uk-UA => uk_UK
and so on.

@Rayhane-mamah, just pulled the latest commit. When I try to preprocess, I get the following error:
Traceback (most recent call last): File "/usr/lib/python3.5/concurrent/futures/process.py", line 175, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/home/iso/Development/Tacotron-2/datasets/preprocessor.py", line 111, in _process_utterance assert time_steps >= T * audio.get_hop_size() AssertionError
I know it has to do with the length and it seems some of the audio-files I'm using are short. But shouldn't the padding solve that automatically?

Thanks

@imdatsolak, Oh god I don't even know what I was thinking while typing that part.. Sorry for the typo, I was using the mel dimension (80) instead of the mel frames (length). there you go, should be good now (54593a0).

Also thanks for the language codes rectification, if there are further mistakes, I'll correct them when upcoming languages are released. Do you have a date for the French release?

@Rayhane-mamah, thanks for the bug-fix!

French: We are currently working on the French dataset. Probably within the next 3-4 weeks as our French-speaking resources are quite limited :-). But the data is ready, the text needs to be QA'd and then final QA done. Then it should be online... I'll let you know immediately.
BTW: Is there a big difference between Tunisian and Saudi in Arabic? Excuse my ignorance, but I'm not so well versed myself in Arabic/dialects.

@imdatsolak, Great I'm really excited for the French dataset!

As for the Arabic, just like USA English and UK English have different accents and sometimes different words, Saudi Arabia's Arabic and Tunisian Arabic also differ (local language speaking). However, there is a "formal" Arabic that is common between all Arab countries and that we all understand. Since you don't have much experience with this language, I'm just gonna say that the Arabic version implemented in Apps that have TTS (like "Siri") is indeed the "formal" Arabic.

So making a single "formal" Arabic version should make much less work, and more people can help doing it. To conclude, I'm 99% sure that the data you're trying to align and clean is in the formal manner, because Arabic is usually written in this common way and read likewise.

@Rayhane-mamah Hi , Thanks a lot for clarifying my query with regard to the GPU issue. I checked it out and found that my CUDNN path was not specified to the activation path of the virtual environment I was using. Now I am able to use the GPU for running the training script. Thanks once again. :)

@Rayhane-mamah Hi, Thanks a lot for sharing your works. Your works would be very helpful to integrate tacotron and wavenet. I want to share our some works for vocoder(here). I'll do my best in my side and share anything one if i got reasonable result.

Hello @twidddj and welcome!

I will make sure to look at your work. Hopefully we can help each other achieve something nice.

@Rayhane-mamah, French is going into QA tomorrow and will be available at latest next week. We have reduced it to 150hrs for now (v0.9). The problem is that the remaining 50+ hours is "Marcel Proust" and "Voltaire". Our QA-People "refuse" to work on Marcel Proust for now :D ... and Voltaire is more work than we anticipated. In any case, over the next few weeks, we will add 1.0;

V0.9 also will be without normalization/transliteration (original text only). But we are working on the transliterated version as well. I thought it might be helpful to have the "raw text" for now for experimenting purposes (and by the time, I can convince our QA-people re Marcel Proust-text, we can add more :DD)

150hrs is awesome for a start! One can start poking around and testing few things.

Hopefully the crew will continue with the remaining 50hrs :)

Awesome work @imdatsolak, really loving this corpus! By the way, en_UK dataset is just perfect, well done!

@Rayhane-mamah , Hi.. I would like to understand the running of this experiment. So in order to clarify my understanding of the training process, During training, each input is trained over for 100 times and at the 100th iteration it is predicted out and saved as ljspeech-mel-prediction-step-(n*100).npy ?

The predicted mel spectrograms have a higher number of frames compared to their corresponding ground truth. Could this be due to the difference in the data processing method I use for deriving the initial ground truth?

The predicted mel spectrogram also seem to have negative values in contrast to the completely positive valued ground truth mel spectrograms.

If you could shed some light into this, it would be great. Thanks a lot. :)

Hello @ferintphilipose, thanks for reaching out!

Actually, training data is always mostly random. Using a tensorflow feeder, we pick a set of random samples, create batches based of data length (to minimize paddings) and feed data to the model (all with shuffling).

So, each 100 steps, we are actually training the model on 100 batches created randomly from the training data. samples of each batch depend on the batch_size parameter (actually set to 64).

I used to save mel plots, alignments and griffin lim inverted wav of the first data in the batch N of training steps where N is a multiplier of 100 (N = k * 100). in this latest commit (0330bd0) I changed N to be a multiplier of 500 and only save summaries to tensorboard each 100 steps.

About the ground truth frames, I find it very weird that your predicted mels have different number of frames from the ground truth wavs considering the TacoTrainingHelper stops the decoding exactly when ground truth is finished. But out of curiosity, are you using a different preprocessing than ours?
Are you actually talking about difference in number of frames when doing a Natural synthesis? That would depend on how well your model learned to output <stop_token>.

It is natural that the model sometimes outputs negative values, especially if you do not impute finished decoder steps. This is due to the fact that the model is making predictions using a Linear projection layer with no restriction of the outputs, values can be negative.

Additionally, in our feeder.py we are explicitly setting the padding frames to be -(hparams.max_abs_value + .1), so if you are using a different preprocessing and your lowest mel value is 0, then please set symmetric_mels = False in hparams.py (it is by default set to True), padding values will become -0.1.

Hopefully this answered your questions? If there is anything else I can assist you with, please let me know.

@Rayhane-mamah
Awesome work! Just wanted to kindly ask when you'd be able to implement wavenet training and synthesis? :)
I have decent GPU resources available right now, and I'd love to train both the feature prediction model and the wavenet vocoder ...

Hello @MXGray, thanks for reaching out.

I am working on it, it shouldn't take longer than a week. I'll try to take care of it this week end if I find the opportunity.

In the meantime, you can start by training the feature prediction model and doing GTA synthesis.

Thanks, @Rayhane-mamah!
Yeah, started with the feature prediction model a few days ago. Looking forward to wavenet vocoder training and synthesis. :)

Hello @Rayhane-mamah!,
Thanks for this, very helpful, with some minor issues, I was able to get it up and running on 1 GPU. It looks like it's not working on multiple GPUs, is that correct? Or did I miss something? I tried running on a 4 GPU machine and only 1 was being used. Thanks!

Hello @keremsozugecer, Thanks for reaching out!

Yeah I actually have not thought about adding a multi-GPU support :)
I will add it for the next commit so stay tuned. Keep in mind that I do not have multiple gpus so you'll have to provide feedback to let me know if things are working properly :)

@Rayhane-mamah!, thanks! we would be happy to provide feedback...

Quick notes about (d28cfa9):

  • Added post processing network to predict Linear Spectrograms in addition to mel ones. To activate this option set predict_linear=True in hparams.py. I did not test it out yet so I'm not sure about performance and results
  • Fixed learning rate decay: New learning rate evolution should look something like this:
    plot
  • took off lowercasing from cleaners, it's beneficial for english but might affect foreign languages negatively (i have no idea actually) so keep that in mind!!

Please note that it is essential you restart the preprocessing in order to train a new model!

commented

@Rayhane-mamah hi,Thank you so much for sharing your work and also thanks for previous comments.
I am very eager to tacotron.
Unfortunately,I have n't decent GPU resources available right now, Is it possible for you to provide pretrained model?

Hello @atreyas313, thank you for reaching out!

I actually have few pretrained models ready for upload locally, I just want to make sure I make the optimal one before releasing it to public.. It should take a couple more days so stay tuned :)

First pretrained model will be trained on Lj speech dataset so I recommend you download the dataset and run the preprocessing to be able to generate GTA samples in case you want to train a Wavenet vocoder after this project.

If you encounter any problems, please let me know!

@Rayhane-mamah hello ! I training on d28cfa9 ! but the learning is become 0 forever
screenshot from 2018-04-18 19-17-27

@Rayhane-mamah thanks you! I'm still in training!

@unwritten @Rayhane-mamah
I get alignment by iteration ~8000 with outputs_per_step=1. All other hyper params the same as master except batch size=32, which was required to avoid out-of-memory on a12GB GPU.
step-6000-align
step-7000-align
step-8000-align
step-9000-align

Hello @jyegerlehner, thanks for your contribution!

I am actually aware of this, I reported it here. I did however find out that model tends to move forward a little faster than ground truth (when synthesizing naturally without teacher forcing) which makes the model read sentences faster than ground truth.I am trying to figure out why this is happening with outputs_per_step=1.

I'll tell you know how it goes :)

I am actually aware of this, I reported it here.

Oops, sorry, I missed that.

commented

@Rayhane-mamah thanks for this repo, good to see it keeps getting better and better.

@jyegerlehner good to know you got an alignment at 8K steps, that was fast. A question: was this trained on LJ-Speech? What batch size did you use to get this running on your machine?

I am using the latest update from April 20 and set batch size to 16, so far no alignment after 25K.

@shaunmayberry Yes LJ dataset. My batch size was 32. With batch size 64 I got OOM errors (12GB GPU). Possibly those latest code changes broke something? I'll pull the latest master and go fire up an instance on the other machine starting from ground zero and see if/when I get good alignments.

commented

@jyegerlehner thanks for your reply. I went to an earlier code dating from April 18 and I was able to set the batch size to 32 without OOM issues. I now got an alignment at 9K steps.

So possibly something different in the latest code from April 20 caused not obtaining an alignment, or I didn't wait wait long enough, I stopped it after 30K.

By the way, alignments does not seem to get learned with a batch size lower than 32..

I'm at 10K steps and there's no hint of alignments. This is current master with no changes. I suspect something was broken in the last set of changes.

Latest commit works fine, I am currently running it..

Depending on model initial states, alignments might show up a bit late (I once had them around 20k). I will think about adding some seeds for future versions :)

You are right. Alignments showed up in this run around 20K.

@Rayhane-mamah, do we have any data that compares the impact on voice quality of different reduce factors?
or reduce factor 1 is better?

thanks

@Rayhane-mamah, thanks for this repo. I am currently working r=1 model on the Korean dataset. I'm not sure why the model tends to move forward a little faster than ground truth. But my model seems to be solved by applying a zoneout to the decoder RNN(LSTM layers) in inference time too.

Hello @Ondal90 thanks for reaching out!

Great catch! but setting zoneout at inference time causes prosody and audio quality to become much worse does it not? Also it doesn't seem right to use zoneout at inference considering it's a regularization technique..

It is however nice to know that it is related to zoneout in some way, and will for sure help us improve the model. Thank you very much for this information!

My first thought is that zoneout at inference time causes the decoder to sometimes keep previous hidden states and cell states causing the RNN to make same prediction a few times consequently and thus slowing speech speed. That's a personal interpretation, I will have to look into that in depth.

@Rayhane-mamah why use max_abs_value for normalization, why max_abs_value is 4?

@unwritten, it is mainly to change the output distribution to be wider which gives in my opinion more detail for the model.

Here is a deeper explanation of the reason. I'm planning on testing the model with default normalization too, but hey I only got one machine..

@Rayhane-mamah
The author of ZONEOUT paper released his code.
ZONEOUT: REGULARIZING RNNS BY RANDOMLY PRESERVING HIDDEN ACTIVATIONS
https://github.com/teganmaharaj/zoneout/blob/master/zoneout_tensorflow.py

In this code, he uses previous information at inference time.

// Inference time
new_state = state_part_zoneout_prob * state_part + (1 - state_part_zoneout_prob) * new_state_part

I wonder why zoneout_LSTM.py uses c = c_temp, h = h_temp at Inference time.
I think this code needs to be changed like this.

// Inference time
h = h_temp * (1-self.zoneout_factor_output) + h_prev * self.zoneout_factor_output
c = c_temp * (1-self.zoneout_factor_cell) + c_prev * self.zoneout_factor_cell
In my test, it works well... Please let me know if i'm wrong.

@Ondal90 Interesting findings. Does your modified code work better than the current version? And after changed the code, do we need to re-train the model?

wavfile_modified.zip

@candlewill
In my experiment, modified code eliminates the problem that it read sentences faster than ground truth.
No. It only change at inference time. So, you can use existing trained models.

@Ondal90 Thanks. It really solved the speech speed problem when using mel spec. However, the speed of wave from linear spectrogram would be still very fast.

@candlewill
Congratulations. I don't know because I only used Mel spectrogram.
If you use WaveNet, It doesn't matter. However, I wonder why not..

@Ondal90, Once again, thanks a lot for these contributions. I suppose I somehow managed to misunderstand the paper :)

Your changes will be applied for future repo version, thank you very much for your help!

@candlewill, Don't forget to make the appropriate changes in the post processing network as well. (We use same architecture as encoder for post processing net, so if my zoneout implementation is wrong, it should affect linear outputs as well..). You tell me how it goes ;)

@Ondal90 and all,

So do you think the training time binary mask scheme that this project uses:

h = binary_mask_output * h_prev + binary_mask_output_complement * h_temp

is equivalent to the ZONEOUT author's code:

new_state = (1 - state_part_zoneout_prob) * tf.python.nn_ops.dropout( new_state_part - state_part, (1 - state_part_zoneout_prob), seed=self._seed) + state_part

To me at first glance it looks like there's at least a difference of a factor (1 - state_part_zoneout_prob). Though still trying to wrap my head around what each is doing exactly.

New samples (Chinese) are here based on @Ondal90 mentioned code : https://goo.gl/YVDBdX

I find if outputs_per_step < 3, it's hard to learn aligment, @candlewill your outputs_per_step >=3 ?

@neverjoe No. I use the default value outputs_per_step = 1.

@jyegerlehner, considering tensorflow's dropout, Zoneout in author's code and this project are the exact same during training (dropout scales the inputs to keep by keep_prob). The other "difference" in author's code is that instead of making a mask and its complementary, they use their tricky mathematical approach which gives same results.

I believe the main thing @Ondal90 wanted to bring attention to is what is referred by "As in dropout, we use the expectation of the random noise at test time" in Zoneout paper. I actually never paid attention to that detail..

Oh another "mistake"(?) I was also doing, is that zoned out states are only meant to be propagated internally to the next state of RNN, in this project I am also using it as the RNNCell output.. with these in mind I will correct the zoneout.

Great work @Ondal90, thanks for sharing!

@Rayhane-mamah

in your code:
bw_cell and fw_cell are the same self._cell?
should there're 2 cell: bw and fw cell?

class EncoderRNN:
.....
#Create LSTM Cell
self._cell = ZoneoutLSTMCell(size, is_training,
zoneout_factor_cell=zoneout,
zoneout_factor_output=zoneout)

def __call__(self, inputs, input_lengths):
	with tf.variable_scope(self.scope):
		outputs, (fw_state, bw_state) = tf.nn.bidirectional_dynamic_rnn(
			self._cell,
			self._cell,
			inputs,
			sequence_length=input_lengths,
			dtype=tf.float32)

		return tf.concat(outputs, axis=2) # Concat and return forward + backward outputs

@unwritten I think it should be different.

@unwritten I also noticed that few days ago, separating cells didn't make any noticeable changes on memory, speed, or even training/loss. Training a model using the original code, then separating cells using the saved model checkpoint causes tensorflow to generate error of missing parameters. I am wondering how a single cell managed to make bidirectional representation of the inputs up until now.. Actually, we are using two cells, but they share parameters.. If someone has seen something similar somewhere, I would love to know the explanation!

In any case, creating two different cells is usually how we perform a bidirectional reading of a sequence, Bahdanau also mentioned that cells in both directions should be independent in his attention paper. So please make sure to create a _fw_cell and a _bw_cell separately. Thank you for your remark! :)

Hi @Rayhane-mamah ,
I am conditioning my WaveNet on log mel values computed as follows:
def get_spectrograms(sound_file):
y, sr = librosa.load(sound_file, sr=16000)
stft = np.abs(librosa.stft(y, n_fft=2048, hop_length=200, win_length=800,
window=scipy.signal.hanning, center=True))**2
mel = librosa.feature.melspectrogram(S=stft, sr=16000,n_fft=2048, n_mels=80, fmin=125, fmax=7600)
mel =np.log10(mel.T + 1)
mel = mel.T.astype(np.float32) # (T, n_mels)
return mel
The data set I am using is from VCTK corpus.

The trained model works fine enough during evaluation with the mel spectrogram computed in this way.
However it is not working with the mel spectrogram computed as per the method used in your data processing script.
I had assumed this would be due to normalization. But then even after de-normalizing the mel values it doesn't work for the conditioning.
It would be really great if you could guide me as to how alter the data pre-processing part in your script to obtain predicted logmel values to be that as obtained with the above method. Thanks.

Hi @Rayhane-mamah,
After the model training is completed, I synthesize the same sentence multiple times using this model. But the result of each synthesis is not exactly the same. Although the synthesized speech sounds similar, there are some differences in the waveform.
Do you know why this happens? The griffin lim algorithm may be one reason. Anything else?
Thanks!

@ferintphilipose Hi, From what I understood I believe you are looking for a feature interpolation like the one used by r9y9 ? I also noticed that others parameters (fft_size, hop_size, etc.) are different in your preprocessing which can also cause problems.

Wavenet cannot recreate speech correctly if trained on mels of some scale and tested on other mels with a different distribution. My advice? Train the Tacotron on the same mels you used to train the WaveNet. Things should go smoother.

@HallidayReadyOne hello!
The variation in synthesis is due to the use of pre-net dropout even at inference times. It can be modified within these lines of code.
I am confident that the following lines in the T2 paper refer to using dropout at inference times:
"In order to introduce output variation at inference time, dropout with probability 0.5 is applied only to layers in the pre-net of the autoregressive decoder."
Let's say it's an "extra" that gives the model some sense of "creativity".

@Rayhane-mamah , First and foremost thanks for your input. I was looking for a way to train the Tacotron using log mel values computed via my method. but then I am bit confused as to how to run your Tacotron script with them. For instance , when I tried to turn off the silence trimming, audio re-scaling and signal normalization, while preprocessing the data, and using the computed data set, I ran into exploding loss error. It would be great if you could shed some insight into this problem. Thanks once again.

@Rayhane-mamah Thank you! My problem, I ignored this detail.
"’
def _griffin_lim(S):
...
angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
...
"
Should this be a reason too?

Hello again @ferintphilipose, sorry for the late reply. I personally recommend that you use our preprocessing while only changing parameters in hparams.py
If however you want to use your own preprocessing, keep in mind that taking off signal normalization will cause loss explosion in this repository because I set a maximum loss value as 100 which can occur when data is not normalized (because mels value will range from -100 to 20 -> squared error will be much bigger).

audio re-scaling is mainly for wavenet, I believe it is necessary to keep to make sure wavs are in [-1, 1].

@HallidayReadyOne, sorry for the late reply. I believe that does not affect the variation in output generation considering that griffin lim is an iterative algorithm that converges after few iterations (60 in our case), initial value will probably have no great effect. It's a little like gradient descent, where you pick random initial values but usually, when the function is convex, you converge to the same minimum.

Finally, @ferintphilipose, wavenet is coming in 10 mins, maybe seeing the entire Tacotron-2 project in one piece can help you solve your issue. If you need anything else, please let me know!