Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech

Question

Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech

erogol opened this issue 4 years ago · comments

Model Link: https://drive.google.com/drive/folders/12Ct0ztVWHpL7SrEbUammGMmDopOKL9X_?usp=sharing

This model is trained with Discrete Grave attention with BatchNorm prenet. It produces good examples with robust attention alignment without any inference time tricks. You can even hear breathing effects with this model in between pauses.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN

(Ignore the small jiggle on the figures caused by TB)

Markus Toman · Answer 1 · Fri Feb 07 2020 00:37:54 GMT+0800 (China Standard Time)

Cool, overall do you prefer the forward attention model over this one?

el-tocino · Answer 2 · Fri Feb 07 2020 12:34:28 GMT+0800 (China Standard Time)

He mentioned "It is the best model so far trained." on the forward model post.

Markus Toman · Answer 3 · Fri Feb 07 2020 13:32:50 GMT+0800 (China Standard Time)

Yes, perhaps I should be more specific. I assume this might be more because of the specific training regimen (switching to batch norm, training longer...) and handholding and not necessarily because of the attention mechanism itself.

Like I got many more better wavernn 10 bit mulaw models in practices although overall I think MoL leads to better results.

But I assume that can not really be answered before lots of experiments with different datasets etc.

Also, the "more natural-sounding" seemed to be a comparison to the forward attention model.

Eren Gölge · Answer 4 · Fri Feb 07 2020 19:14:06 GMT+0800 (China Standard Time)

My two cents are

Graves easier to train in different datasets and it is more natural sounding with a better prosody
Forward attention leads to a more robust attention alignment and easier to integrate with PWGAN trained on ground truth spectrograms.

Disclaimer: I was about to release the graves model but then I removed the whole model by mistake.
Now retraining it :)

vcjob · Answer 5 · Mon Feb 10 2020 16:32:14 GMT+0800 (China Standard Time)

@ergol, why do you prefer PWGAN over MelGAN? It is faster, while the quality seems fine. Btw, on https://github.com/kan-bayashi/ParallelWaveGAN they provide now MelGAN as well. Any plans to try it, adapt for TTS?
Also, the official paper states PWGAN's MOS is quite higher than WaveNet's MOS. Is it hard to get similar results, or the authors (https://arxiv.org/pdf/1910.11480.pdf) prettify it a little bit?

Markus Toman · Answer 6 · Mon Feb 10 2020 17:01:36 GMT+0800 (China Standard Time)

@vcjob interesting, I find even the PWGAN official samples of just vocoded recordings already exhibit some artefacts. r9y9s taco-wavenet (MoL) samples definitely sound better.
Also wavernn gave me better results than MelGAN, although with LJ they are pretty similar. But definitely on other speakers and better results than on the official melgan demo page.

I think the difference in the PWGAN paper is just because they used the espnet Gaussian Wavenet. I tried all their models and they are definitely not as good as r9y9s Wavenet.
No wonder considering how much effort went into that over the years...and of course, it's ultra-slow.

Also interesting how more or less nobody uses the original wavernn formulation. Even the amazon papers use a simple GRU followed by FCs predicting quantized output via softmax.

Well, in the end they're all annoying for some different reason ;)

EDIT: just realized the main author of PWGAN is r9y9. Even stranger he didn't use his own Wavenet implementation for comparison

Eren Gölge · Answer 7 · Mon Feb 10 2020 19:27:49 GMT+0800 (China Standard Time)

@vcjob PWGAN is easier to adapt to TTS and the model is smaller. Now, I also train MelGAN type generator as the official repo suggested. But it'd be nice to try original MelGAN with TTS if you are interested.

A paper is a paper :).

Had · Answer 8 · Mon Feb 10 2020 19:58:15 GMT+0800 (China Standard Time)

I'm trying your implementation of graves attention with my fork of Nvidia tacotron2.
But soon or later I get a gradient explosion, should you advise how to deal with it?

Eren Gölge · Answer 9 · Tue Feb 11 2020 00:48:21 GMT+0800 (China Standard Time)

@hadaev8 is it the latest implementation?

Had · Answer 10 · Tue Feb 11 2020 04:38:46 GMT+0800 (China Standard Time)

@erogol
Latest from master.

Eren Gölge · Answer 11 · Tue Feb 11 2020 07:12:14 GMT+0800 (China Standard Time)

@hadaev8 try the one in dev branch

Had · Answer 12 · Thu Feb 20 2020 23:27:15 GMT+0800 (China Standard Time)

@erogol
This one works fine, but why max attention value is 0.5?

Eren Gölge · Answer 13 · Fri Feb 21 2020 00:46:07 GMT+0800 (China Standard Time)

because you are normalizing it. Actually this reduces the quality at inference time I guess. If you have solution for this, I'd like to know.

Had · Answer 14 · Fri Feb 21 2020 05:06:07 GMT+0800 (China Standard Time)

@erogol
Should you point exact line with normalisation? Im bit lost in math.

Eren Gölge · Answer 15 · Tue Feb 25 2020 19:30:55 GMT+0800 (China Standard Time)

@hadaev8 it is not an explicit normalization.

Since values are bounded in [0, 1] even without discretization, with discretization they are also bounded in the same range. And because we do subtraction between time steps, the effective range comes close to zero. In our case it is [0, ~0.4]. So we could find a trick to expand this range.

Eren Gölge · Answer 16 · Tue Feb 25 2020 19:35:31 GMT+0800 (China Standard Time)

I released the model finally with couple of changes. This moel uses Batch Norm prenet from the beginning.

Eren Gölge · Answer 17 · Fri Feb 28 2020 20:11:46 GMT+0800 (China Standard Time)

One interesting problem with Graves's attention is that actually after the model converges only one of the attention heads is actively used suppressing the other heads. Which is an indicator of using only one head would also work fine with faster run-time.

Or dropout might be used to randomized the behavior of the heads in training assuming that would learn the other heads.

Shikhar Dev Gupta · Answer 18 · Tue Apr 21 2020 06:56:31 GMT+0800 (China Standard Time)

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

Eren Gölge · Answer 19 · Thu Apr 23 2020 06:13:38 GMT+0800 (China Standard Time)

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

It is actually true. Yet it worked?. Thx for the catch. I'll fix it and try again.

Shikhar Dev Gupta · Answer 20 · Thu Apr 23 2020 07:32:28 GMT+0800 (China Standard Time)

@erogol A unexpected but welcome surprise!
Ive been trying to port your implementation to tensorflow for my code, and for some reason, the attention values very quickly die to values close to 0. Any suggestions into where I should look for the issue?

`
def init(self, memory_sequence_length=None, training=True, name="GravesAttention" ):

self.training = training
with tf.name_scope(name, 'GmmAttentionMechanismInit'):
  self._mask_value = 1e-8
  self.maybe_mask_score = lambda x: _maybe_mask_score(x, memory_sequence_length, self._mask_value)
# Number of gaussians in the mixture
self.K = 5
self.eps = 1e-5

bias_init = tf.constant_initializer( np.hstack([np.zeros(self.K), np.full(self.K, 10), np.ones(self.K)]) )
layer1 = tf.layers.Dense( units=num_units, activation="relu", name="graves_attention_denselayer1", trainable=True, dtype=dtype )
layer2 = tf.layers.Dense( units=3*self.K, bias_initializer=bias_init, name="graves_attention_denselayer2", trainable=True, dtype=dtype )
self.dense_layer = lambda x: layer2(layer1(x))

self.J = tf.cast( tf.range( self.seq_len + 2 ), dtype=tf.float32 ) + 0.5

def call(self, query, state):

seq_length = self._alignments_size
mu_prev = state
with variable_scope.variable_scope(None, "graves_attention", [query]):
  j = tf.slice( self.J, [0], [ seq_length+1 ] )

  gbk_t = self.dense_layer( query )
  g_t, b_t, k_t = tf.split( gbk_t, num_or_size_splits=3, axis=1 )

  mu_t = mu_prev + tf.math.softplus(k_t)
  sig_t = tf.math.softplus(b_t) + self.eps

  g_t = tf.layers.dropout( g_t, rate=0.5, training=self.training )
  g_t = tf.nn.softmax( g_t, axis=1 ) + self.eps

  x = (j-tf.expand_dims(mu_t, -1))/ tf.expand_dims(sig_t, -1)
  phi_t = tf.expand_dims(g_t, -1) * tf.nn.sigmoid( x )

  alpha_t = tf.reduce_sum( phi_t, 1 )

  # discretize
  a = tf.slice( alpha_t, [0, 1], [self._batch_size, seq_length] )
  b = tf.slice( alpha_t, [0, 0], [self._batch_size, seq_length] )
  alpha_t = a-b

  alpha_t = self.maybe_mask_score(alpha_t)

next_state = mu_t 
return alpha_t, next_state`

Eren Gölge · Answer 21 · Sat Apr 25 2020 00:32:23 GMT+0800 (China Standard Time)

not sure, maybe you can try the broken version as in my code.

Eren Gölge · Answer 22 · Sat Apr 25 2020 00:58:27 GMT+0800 (China Standard Time)

If I use your version, attention weights are computed negative. It is weird.

Shikhar Dev Gupta · Answer 23 · Sun Apr 26 2020 01:15:22 GMT+0800 (China Standard Time)

I think I know whats happening. Your earlier implementation used a distribution that was monotonically decreasing, but your (mu_t - j) was flipped(possibly because you thought you were using exp instead of sigmoid), so it worked out just fine.
So, just change mu_t- j to j-mu_t, and your values should be positive again.

Eren Gölge · Answer 24 · Sun Apr 26 2020 09:16:44 GMT+0800 (China Standard Time)

yeah that's a great return. I totally missed that.

Eren Gölge · Answer 25 · Tue Apr 28 2020 16:39:43 GMT+0800 (China Standard Time)

@Shikherneo2 as I changed the implementation as you said and I had the same problem. After 10K iterations all the alignment turns out zero.

Shikhar Dev Gupta · Answer 26 · Wed Apr 29 2020 01:05:37 GMT+0800 (China Standard Time)

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.

Eren Gölge · Answer 27 · Wed Apr 29 2020 16:50:25 GMT+0800 (China Standard Time)

In my case, network goes to zero sometimes after 10K and sometimes 60K. I checked the layer statistics through the training but I could not see something explanatory.

Eren Gölge · Answer 28 · Wed Apr 29 2020 18:37:17 GMT+0800 (China Standard Time)

It is interesting. The function I used previously is a reverse sigmoid with a squashed range around 2/3. So mathematically it makes no sense but it worked.

Yunchao He · Answer 29 · Tue May 26 2020 20:43:10 GMT+0800 (China Standard Time)

What's the benefit to discritize attention weights? Why don't directly use the original version?

Eren Gölge · Answer 30 · Thu May 28 2020 15:46:51 GMT+0800 (China Standard Time)

It mathematically makes more sense to me and it works better.

WhiteFu · Answer 31 · Fri Aug 28 2020 21:45:59 GMT+0800 (China Standard Time)

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.
请问你解决了么，我也遇到了相同的问题。

Shikhar Dev Gupta · Answer 32 · Fri Aug 28 2020 22:34:09 GMT+0800 (China Standard Time)

@WhiteFu No. I wasn't able to. When I looked at the statistics, I realized that the encoder gradients were going to zero after a few thousand iterations. So I added a highway network (like in Tacotron-1), which stabilized the training. But the weights still all go to zero.

WhiteFu · Answer 33 · Wed Sep 02 2020 14:11:22 GMT+0800 (China Standard Time)

@Shikherneo2 this is weird, I will follow up and let you know if there is any progress！

Eren Gölge · Answer 34 · Mon Sep 07 2020 17:20:14 GMT+0800 (China Standard Time)

should I reopen the issue if anyone working on it?

Liujingxiu23 · Answer 35 · Tue Sep 22 2020 09:54:51 GMT+0800 (China Standard Time)

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?

phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large？
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

LeoniusChen · Answer 36 · Tue Mar 30 2021 11:36:35 GMT+0800 (China Standard Time)

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?

phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large？
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

In Mozilla/TTS, Graves Attention is discrete. Now you can use codes in this Repo to implement DCA or GMM attention.