deterministic-algorithms-lab / Cross-Lingual-Voice-Cloning

Tacotron 2 - PyTorch implementation with faster-than-realtime inference modified to enable cross lingual voice cloning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Input text type

leijue222 opened this issue · comments

The website example is awesome.

I have a question about the input text type.
For Mandarin, I know the following type:

  1. kǎěrpǔ péi wàisūn wán huátī。
  2. ka3er3pu3 pei2 wai4sun1 wan2 hua2ti1。
  3. k a3 er3 p u3 p ei2 w ai4 s un1 w an2 h ua2 t i1 。
    Could you tell me which of the above types you used?

For English:
Just use the English word such as:in being comparatively modern.

How about you?


I just read your paper, you had tried three types of Characters, UTF-8 Encoded Bytes, Phonemes, and the Phoneme type is the best.
In fact, I have also trained with phonemes before, I use the Python package of phonemizer==2.1 to synthesize the phonemes of Mandarin. And some generated phonemes have no tone. So the result, just like your paper said:

CN raters commented that it sounded like “a foreigner speaking Chinese”

The reason is that the model has a poor judgment of tone so that it can't distinguish the four tones of Mandarin.

Therefore, what tool do you use to generate phonemes for each language?
And finally, could you give me an example of train.txt? Like this:

<path-to-wav-file>|b ao3 an1 y ong4 sh ou3 q ia1 zh u4 j i4 zh e3 b o2 z i q iang3 x iang4 j i1 。|0|Mandarin
<path-to-wav-file>|it had arrangements to be notified about release from confinement in roughly one thousand cases;|1|English

The code seems to be incomplete, such as the preprocessing of Mandarin and English data.

The README is really not very detailed, I don't know how to train the Code-Switching of Mandarin English.

@leijue222 I am sorry to say this, but I am not the original author of the paper. I couldn't find any open source implementation of the paper, so I made one. I work in Indic languages and English, so I don't know about Mandarin. And the repo is currently just basic implementation of the model.

1.) I think the standard way to generate phonemes is to use MFA as shown here . You can also try WikiPron. Using these two models, you can train your Grapheme to Phoneme models that capture word specific tone too, provided you give g2p data that has tone in it.

2.) If you can't find g2p data with tones, for training a g2p model, introducing a latent variable corresponding to the tone in the residual encoder, and using more Mandarin data would be a natural way to improve performance for Mandarin.

You can consider writing mails to the original authors, if you want implementation specific details that aren't mentioned in the paper.

Also, I would be grateful if you consider going through the code-base and improving it for Mandarin specific case.