mozilla / TTS

:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Model Release: Tacotron2 with Discrete Graves Attention - LJSpeech

erogol opened this issue · comments

Model Link: https://drive.google.com/drive/folders/12Ct0ztVWHpL7SrEbUammGMmDopOKL9X_?usp=sharing

This model is trained with Discrete Grave attention with BatchNorm prenet. It produces good examples with robust attention alignment without any inference time tricks. You can even hear breathing effects with this model in between pauses.

You can also use this TTS model with PWGAN or WaveRNN vocoders. PWGAn provides real-time voice synthesis and WaveRNN is slower but provides better quality.

https://github.com/erogol/ParallelWaveGAN
https://github.com/erogol/WaveRNN

(Ignore the small jiggle on the figures caused by TB)
image

image

Cool, overall do you prefer the forward attention model over this one?

He mentioned "It is the best model so far trained." on the forward model post.

Yes, perhaps I should be more specific. I assume this might be more because of the specific training regimen (switching to batch norm, training longer...) and handholding and not necessarily because of the attention mechanism itself.

Like I got many more better wavernn 10 bit mulaw models in practices although overall I think MoL leads to better results.

But I assume that can not really be answered before lots of experiments with different datasets etc.

Also, the "more natural-sounding" seemed to be a comparison to the forward attention model.

My two cents are

Graves easier to train in different datasets and it is more natural sounding with a better prosody
Forward attention leads to a more robust attention alignment and easier to integrate with PWGAN trained on ground truth spectrograms.

Disclaimer: I was about to release the graves model but then I removed the whole model by mistake.
Now retraining it :)

commented

@ergol, why do you prefer PWGAN over MelGAN? It is faster, while the quality seems fine. Btw, on https://github.com/kan-bayashi/ParallelWaveGAN they provide now MelGAN as well. Any plans to try it, adapt for TTS?
Also, the official paper states PWGAN's MOS is quite higher than WaveNet's MOS. Is it hard to get similar results, or the authors (https://arxiv.org/pdf/1910.11480.pdf) prettify it a little bit?

@vcjob interesting, I find even the PWGAN official samples of just vocoded recordings already exhibit some artefacts. r9y9s taco-wavenet (MoL) samples definitely sound better.
Also wavernn gave me better results than MelGAN, although with LJ they are pretty similar. But definitely on other speakers and better results than on the official melgan demo page.

I think the difference in the PWGAN paper is just because they used the espnet Gaussian Wavenet. I tried all their models and they are definitely not as good as r9y9s Wavenet.
No wonder considering how much effort went into that over the years...and of course, it's ultra-slow.

Also interesting how more or less nobody uses the original wavernn formulation. Even the amazon papers use a simple GRU followed by FCs predicting quantized output via softmax.

Well, in the end they're all annoying for some different reason ;)

EDIT: just realized the main author of PWGAN is r9y9. Even stranger he didn't use his own Wavenet implementation for comparison

@vcjob PWGAN is easier to adapt to TTS and the model is smaller. Now, I also train MelGAN type generator as the official repo suggested. But it'd be nice to try original MelGAN with TTS if you are interested.

A paper is a paper :).

commented

I'm trying your implementation of graves attention with my fork of Nvidia tacotron2.
But soon or later I get a gradient explosion, should you advise how to deal with it?

@hadaev8 is it the latest implementation?

commented

@erogol
Latest from master.

@hadaev8 try the one in dev branch

commented

@erogol
This one works fine, but why max attention value is 0.5?

because you are normalizing it. Actually this reduces the quality at inference time I guess. If you have solution for this, I'd like to know.

commented

@erogol
Should you point exact line with normalisation? Im bit lost in math.

@hadaev8 it is not an explicit normalization.

Since values are bounded in [0, 1] even without discretization, with discretization they are also bounded in the same range. And because we do subtraction between time steps, the effective range comes close to zero. In our case it is [0, ~0.4]. So we could find a trick to expand this range.

I released the model finally with couple of changes. This moel uses Batch Norm prenet from the beginning.

One interesting problem with Graves's attention is that actually after the model converges only one of the attention heads is actively used suppressing the other heads. Which is an indicator of using only one head would also work fine with faster run-time.

Or dropout might be used to randomized the behavior of the heads in training assuming that would learn the other heads.

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

Awesome work!
I was curious about one thing though. In your implementation of Graves GMM attention, is there a typo at line 179? https://github.com/mozilla/TTS/blob/dev/layers/common_layers.py#L179

Following https://arxiv.org/pdf/1906.01083.pdf , shouldnt it be
phi_t = g_t.unsqueeze(-1) * torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))

It is actually true. Yet it worked?. Thx for the catch. I'll fix it and try again.

@erogol A unexpected but welcome surprise!
Ive been trying to port your implementation to tensorflow for my code, and for some reason, the attention values very quickly die to values close to 0. Any suggestions into where I should look for the issue?

`
def init(self, memory_sequence_length=None, training=True, name="GravesAttention" ):

self.training = training
with tf.name_scope(name, 'GmmAttentionMechanismInit'):
  self._mask_value = 1e-8
  self.maybe_mask_score = lambda x: _maybe_mask_score(x, memory_sequence_length, self._mask_value)
# Number of gaussians in the mixture
self.K = 5
self.eps = 1e-5

bias_init = tf.constant_initializer( np.hstack([np.zeros(self.K), np.full(self.K, 10), np.ones(self.K)]) )
layer1 = tf.layers.Dense( units=num_units, activation="relu", name="graves_attention_denselayer1", trainable=True, dtype=dtype )
layer2 = tf.layers.Dense( units=3*self.K, bias_initializer=bias_init, name="graves_attention_denselayer2", trainable=True, dtype=dtype )
self.dense_layer = lambda x: layer2(layer1(x))

self.J = tf.cast( tf.range( self.seq_len + 2 ), dtype=tf.float32 ) + 0.5

def call(self, query, state):

seq_length = self._alignments_size
mu_prev = state
with variable_scope.variable_scope(None, "graves_attention", [query]):
  j = tf.slice( self.J, [0], [ seq_length+1 ] )

  gbk_t = self.dense_layer( query )
  g_t, b_t, k_t = tf.split( gbk_t, num_or_size_splits=3, axis=1 )

  mu_t = mu_prev + tf.math.softplus(k_t)
  sig_t = tf.math.softplus(b_t) + self.eps

  g_t = tf.layers.dropout( g_t, rate=0.5, training=self.training )
  g_t = tf.nn.softmax( g_t, axis=1 ) + self.eps

  x = (j-tf.expand_dims(mu_t, -1))/ tf.expand_dims(sig_t, -1)
  phi_t = tf.expand_dims(g_t, -1) * tf.nn.sigmoid( x )

  alpha_t = tf.reduce_sum( phi_t, 1 )

  # discretize
  a = tf.slice( alpha_t, [0, 1], [self._batch_size, seq_length] )
  b = tf.slice( alpha_t, [0, 0], [self._batch_size, seq_length] )
  alpha_t = a-b

  alpha_t = self.maybe_mask_score(alpha_t)

next_state = mu_t 
return alpha_t, next_state`

not sure, maybe you can try the broken version as in my code.

If I use your version, attention weights are computed negative. It is weird.

I think I know whats happening. Your earlier implementation used a distribution that was monotonically decreasing, but your (mu_t - j) was flipped(possibly because you thought you were using exp instead of sigmoid), so it worked out just fine.
So, just change mu_t- j to j-mu_t, and your values should be positive again.

yeah that's a great return. I totally missed that.

@Shikherneo2 as I changed the implementation as you said and I had the same problem. After 10K iterations all the alignment turns out zero.

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.

In my case, network goes to zero sometimes after 10K and sometimes 60K. I checked the layer statistics through the training but I could not see something explanatory.

It is interesting. The function I used previously is a reverse sigmoid with a squashed range around 2/3. So mathematically it makes no sense but it worked.

What's the benefit to discritize attention weights? Why don't directly use the original version?

It mathematically makes more sense to me and it works better.

@erogol That is very weird. I have tried a bunch of small tweaks, and the values always quickly go to zero. In my case they even go to zero with your earlier implementation.
请问你解决了么,我也遇到了相同的问题。

@WhiteFu No. I wasn't able to. When I looked at the statistics, I realized that the encoder gradients were going to zero after a few thousand iterations. So I added a highway network (like in Tacotron-1), which stabilized the training. But the weights still all go to zero.

@Shikherneo2 this is weird, I will follow up and let you know if there is any progress!

should I reopen the issue if anyone working on it?

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?
捕获
phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large?
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

@erogol
I am confused about the graves attention.
The code of graves attention just use sigmoid instead of exp as in the paper, right?
捕获
phi_t = g_t.unsqueeze(-1) * (1 / (1 + torch.sigmoid((mu_t.unsqueeze(-1) - j) / sig_t.unsqueeze(-1))))

This version is a good one, right? I tried graves attention in my own tts work(only add while loop to process all time steps) but alignment failed. I am tring to figure out the problem.
I usd K=8, is it too large?
The mask in "alpha_t.data.masked_fill_" should be like [false false false ....True True], to mask the padding, right?

In Mozilla/TTS, Graves Attention is discrete. Now you can use codes in this Repo to implement DCA or GMM attention.