syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sample Alignment Graph

fazlekarim opened this issue · comments

Hi,

Can you share the alignment graphs that you are obtaining for your audio samples? For most of my alignments, the y-axis is about half of the x-axis. Is there a reason why this is happening? In Keithito's repo, the shared alignment graphs have a 1-1 scale. In other words, the range of the x-axis and the y-axis is the same.

@fazlekarim You can find them in the demo page dir:
https://github.com/syang1993/syang1993.github.io/tree/master/gst-tacotron/style-samples

In keithito's tacotron, reduce_factor is 5, in which case the length of characters and frames are similar. But in this repo, reduce factor is 2, the mel-spec is about 2 times longer than text.

@fazlekarim I have the same problem with you that the y-axis is about half or even more of the x-axis. How did you solve the problem?

@syang1993, in my case, all the alignment graphs generated at the point of checkpoints (every 1000 steps) turn out to be the way described by @zyj008. I attach a sample png:

gst-step-1147000-align

If I use regular Tacotron from keithito, the range of both axes turns out to be right about the same.

Do you have an explanation?

@abuvaneswari Hi, as I described above, the x-axis means the length of mel-spectrum and the y-axis means the number of characters. The alignment path (attention matrix) only shows the weights between each character and each frame. In your attached image, there are about 70 characters, and the corresponding audio has about 250 frames. I use reduce_factor=2 so the number is about 125 (x-axis length), if you use reduce_factor=5 as Keithito's repo, the number is about 50.