deterministic-algorithms-lab / Cross-Lingual-Voice-Cloning

Tacotron 2 - PyTorch implementation with faster-than-realtime inference modified to enable cross lingual voice cloning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Attention Alignment Not Working

jinny1208 opened this issue · comments

commented

I am currently training the provided model with Korean and English datasets, with a total of 27 speakers.
As stated in the README.md, I added Korean to "symbols" as follows:

image

The problem is that even after training the model for over 45,000 steps, the attention alignment is not forming.

image

The target and predicted mel-spectrograms seem similar enough.

image

To anyone who has used this repo and to @Jeevesh8 , how long does it normally take for the attention to start aligning properly? Should I continue training?

Any help and advice would be greatly appreciated.

@jinny1208 This is common observed phenomenon[See here, for example]. The reason is mainly, I think because mel-spectrogram is predicted frame by frame(auto-regressively), so even if the model learns to just copy the last frame of the sequence(without learning anything about alignment), it can easily lower the loss quite a bit. You need to train longer.

I would suggest you to initialise from Tacotron-2 English pre-trained weights, for faster alignment.

commented

Hi, thanks for your answer! I will train longer and update the results here

commented

image

The alignment works to some degree. I probably need to train longer and do somthing else to get clearer and more robust alignments.