syang1993 / gst-tacotron

A tensorflow implementation of the "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

poor alignment with test out-of-collection data

butterl opened this issue · comments

Hi @syang1993 ,I tried run a 170K+ train test and the train output is perfect among the many tacotron implementation
But I notice that with out-of-collection data the output aligned is always mass,I tried with Rayhane-mamah‘s repo(good rhythm but bad voice), the out-of-collection data always get good rhythm . Do you got any advice on that ? The reference audio is from the training data

the trained alignment map
image

the test alignment map with out-of-collection data
image

Hi, what do you mean out-of-collection? The reference is from test set of the same speaker or from a different speaker? In my experiments with tacotron1, I tried unseen text and unseen reference audio (same speaker), the results are good. But sometimes (especially long sentences) it will loss words. Using better attention will improve it.

I guess you mean you add the style attention part to the tacotron2, and the generated audio is of good rhythm but bad quality? Can you share the results? Another guy in our lab is also trying the style part with tacotron2 these days. I may share it after I back to school.

@syang1993 thanks for reaching out!

The train dataset is more like news or poetry (THSCH30), and the “out-of-collection” I mean is unseen text( more like colloquial statements). The gst tacotron perform very good output in trainset text, but turn mass with unseen text, this dosen't comes for Rayhane-mamah‘s repo , Rayhane's sample generate good rhythm on unseen text with same train set but the voice quality is bad. So I'm finding way to combie both good voice and good rhythm together( wavenet is one solution,but too slow, and I could not reproduce good result after a disk failure :( )

I'm comparing the audio quality between the tacotron implementations(yours generate best tacotron output from all my test, no background echo compared to others) and I‘m not start to transfer code yet(different author have much different in code style,not sure which one to use as a base version).

And tacotron2 + gst + (maybe GAN , speaker etc) would be good from my insight. And I would be much glad if you could share the tacotron2 + style implementation,I'd like help test on my machine and give feed back to make it move faster.

BRs,
Butter

@butterl Hi, in the demo page of this repo, I guess the sentences of id 6-10 are the unseen text. Did you try other attention methods such as location-sensitive one? It will improve the alignments, since your alignments of out-of-collection data seems 'repeat'.

For tacotron1 implementation, you can use keithito' repo, and fix some issues as keithito/tacotron#182. It will get good results.

I will back to school next Monday, then I will test both the tacotron2 and fftnet code. When I finish it, I can share it to you.