Sample Alignment Graph

Question

Sample Alignment Graph

fazlekarim opened this issue 6 years ago · comments

Hi,

Can you share the alignment graphs that you are obtaining for your audio samples? For most of my alignments, the y-axis is about half of the x-axis. Is there a reason why this is happening? In Keithito's repo, the shared alignment graphs have a 1-1 scale. In other words, the range of the x-axis and the y-axis is the same.

Shan Yang · Answer 1 · Sat Jul 07 2018 10:20:22 GMT+0800 (China Standard Time)

@fazlekarim You can find them in the demo page dir:
https://github.com/syang1993/syang1993.github.io/tree/master/gst-tacotron/style-samples

In keithito's tacotron, reduce_factor is 5, in which case the length of characters and frames are similar. But in this repo, reduce factor is 2, the mel-spec is about 2 times longer than text.

zyj008 · Answer 2 · Thu Aug 16 2018 14:37:50 GMT+0800 (China Standard Time)

@fazlekarim I have the same problem with you that the y-axis is about half or even more of the x-axis. How did you solve the problem？

abuvaneswari · Answer 3 · Fri Nov 16 2018 11:44:34 GMT+0800 (China Standard Time)

@syang1993, in my case, all the alignment graphs generated at the point of checkpoints (every 1000 steps) turn out to be the way described by @zyj008. I attach a sample png:

If I use regular Tacotron from keithito, the range of both axes turns out to be right about the same.

Do you have an explanation?

Shan Yang · Answer 4 · Fri Nov 16 2018 12:07:40 GMT+0800 (China Standard Time)

@abuvaneswari Hi, as I described above, the x-axis means the length of mel-spectrum and the y-axis means the number of characters. The alignment path (attention matrix) only shows the weights between each character and each frame. In your attached image, there are about 70 characters, and the corresponding audio has about 250 frames. I use reduce_factor=2 so the number is about 125 (x-axis length), if you use reduce_factor=5 as Keithito's repo, the number is about 50.