seastar105 / pflow-encodec

Implementation of TTS model based on NVIDIA P-Flow TTS Paper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

performance

yiwei0730 opened this issue · comments

I would like to ask you about how much information you can use to achieve such good results, and how long the training lasts.

libritts model is trained about 2 days for 265k steps with batch duration 100 and 4 accumulation. trained on one 4090.

I've update bit more detail

@seastar105 Pure discussion:
Judging from the speech test results I have at hand, it seems that the similarity between using your EJK pre-trained model to synthesize speech out of the data is a bit awkward. I don’t know if it is because the number of included speakers in the training model is too small, which leads to poor synthesis results after the new ref is added.

@yiwei0730
I did not checked this properly yet, it seems multi-language model currently released produces really poor zero-shot capability, when compared with mono-lingual models. english model trained on libritts-r shows quite good zero-shot.

i'm gonna check it on weekends using speaker embedding cosine sim, t-SNE, language classifier on speaker embedding to check if speaker embedding has too much language information, which may lead to poor zero-shot.

of course, it will make better performance if you scale dataset and model properly. many works already show that(e.g. VALL-E, HAM-TTS, MEGA-TTS 2, Audiobox)

What I'm thinking about is how much information the prompt you take out from Tokenizer_latent contains about the speaker. It seems that NS2 starts extracting from the original sound file prompt (I'm not saying this is better). Most of the problems in current testing seem to lie in the similarity of the speakers.

I am currently retraining the Chinese and English models (without using the EJK ckpt you gave me). I want to try to see what the results of retraining will be first, and then see what the effect will be on finetune.

I think more complicated training process or architecture would be necessary for performant cross-lingual zero shot.

First, I found korean and japanese dataset has quite small number of speakers(50~60) compared to libritts-r(>1000). so even korean and japanese corpus has bigger size, it seems that model's speaker encoder capability comes from mainly libritts. also libritts and aihub's dataset has really different acoustic environment(studio vs libri-vox).

Second, due to reason of dataset differences, cross-lingual zero-shot becomes much harder for model.
I measured speaker similarity with WavLM-TDNN as VALL-E and PFlow did, and it shows

format is {text_language}_{prompt_language}, each format has calculated for 100 utterances, then averaged.
prompt used for evaluation has same recording environment. english prompt was libritts-test-clean, and korean and japanese prompts are studio-recording(from aihub and jvs).
since prompt was not reconstructed using encodec, and mbd. maybe it could be higher if use recon prompt.

multi lingual model

{
	"en_en": 0.47815948724746704,
	"en_ja": 0.2348366528749466,
	"en_ko": 0.22839537262916565,
	"ja_en": 0.2402830570936203,
	"ja_ja": 0.498717337846756,
	"ja_ko": 0.4511962234973905,
	"ko_en": 0.23162104189395905,
	"ko_ja": 0.4218471646308899,
	"ko_ko": 0.4363690912723541
}

libritts model

{
	"en_en": 0.47299426794052124,
	"en_ja": 0.23013409972190857,
	"en_ko": 0.21702246367931366
}

intra-lingual generation has reasonable performance, but cross-lingual generation, especially for en <-> ko,ja has worse performance.

So, my answer to

@seastar105 Pure discussion: Judging from the speech test results I have at hand, it seems that the similarity between using your EJK pre-trained model to synthesize speech out of the data is a bit awkward. I don’t know if it is because the number of included speakers in the training model is too small, which leads to poor synthesis results after the new ref is added.

EJK model's poor performance mainly came from large dataset difference between en, and {ko, ja}, and small number of speakers in korean, japanese dataset. we need libritts-level quality open TTS dataset for each language..

Thank you for your test. It probably means how many people we need to use balance data and will the data be more helpful for training?

Report the latest status
I use 100,000 pieces of information each in Chinese (300-spks) and English (1000-spks) for a total of 200,000 for training.
Finally, I found that the sound quality is good (although sometimes some sounds may be mispronounced), but the main problem is the similarity part, which is better than the other ones I have trained.
I test the similarity using
feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/wavlm-base-plus-sv")
model = WavLMForXVector.from_pretrained("microsoft/wavlm-base-plus-sv")
The audio was tested at 24K and 16K. Although the results were mostly very high, it seemed to be a bit similar and a bit different.
24K 16K
0.87 0.58
0.97 0.94
0.97 0.92
0.53 0.89
0.75 0.88
0.96 0.55
0.86 0.75
0.97 0.91
Do I need to add a feature extraction function for ref audio files, or simply do I need to use more speakers (currently 1300 speakers), as well as the use of balance for each language mentioned last time (this time ZS It doesn’t feel like a foreign accent)

@yiwei0730 Thanks for sharing your experience.
glad to hear trained model produced better result. also sad to hear it's still not good. i have questions and suggestion from your reply

  1. First, i suggest you to try stopes, since this model is used for VALL-E, P-Flow, Voicebox, AudioLM and some other models from Big Tech companies, to directly compare model performance. i think there're much less work used wavlm-base-plus-sv to calculate WavLM-TDNN from WavLM-large, instead of wavlm-base-plus-sv.
  2. What is difference between 24K and 16K? wavlm requires its input speech to be 16KHz sample rate. do you mean sample rate of prompt speech used for generation?
  3. Do I need to add a feature extraction function for ref audio files, or simply do I need to use more speakers (currently 1300 speakers), what does it mean feature extraction function? model in this repo has Speaker Encoder, which is used for extracting speaker embedding from raw waveform. also, is your speaker similarity is cross-lingual?

I expect you to answer three questions above.

By the way, I'm training 1M steps for LibriTTS-R to verify model's capability for intra-lingual zero-shot TTS, at least matching to P-Flow paper.

For cross-lingual cloning, i think architecture changes and different training method is necessary.

  1. Yes, maybe the xvector similarity calculated by wavlm-base-plus-sv may not be correct, and I did not check cross-language, because in some cases the tone is similar but sounds specious, it is indeed a bit suspicious that he has a similarity of 0.9. I may want to ask you how to use wavLM-TDNN. I didn't see this in wavLM's config.

  2. 24K and 16K refer to the sample rate of the audio file. Since the wavlm model actually only needs 1 dimension, I used 24K and 16K audio files for similarity testing.

  3. Yes, this model has a speaker encoder. My understanding is that it takes the latent space that has been tokenized and take some part of latent space to be the prompt, and then throws the prompt into the model.
    Since the sound quality in my current training is good (although there are a little mispronounced, it may not be a big impact), the main problem is the similarity (currently I use mandarin to test mandarin, if you need cross-lingual I can provide the similarity. But maybe need to consider using your recommended wavLM-TDNN for calculation, which may be more accurate).
    I'm thinking about how to increase the similarity of speaker models, whether I can learn from Hierspeech's architecture or Naturalspeech 2, etc.

But maybe you mentioned an interesting thing. I may have to test all pure language models first, and then test the mixed language model. Next week I should start to try to remove the English data and only train the Chinese data model. (Currently my Chinese and English model training steps are 315000step batchsize for 300 and gradient in 4 steps)

Finally, I am also thinking about whether to add some feature parameters. Recently, it seems that the model still needs to be expanded to achieve better results. If you have any ideas for adding modules or suggestions for training methods, I would be happy to hear them.

  1. To use WavLM-TDNN, you can check this link i mentioned as stopes. this repository offers easier interface to use, better than original UniSpeech.
  2. Thanks. it seems reasonable result, 24K is better than 16K since model is trained with 24K.
  3. I'm not sure how to increase speaker similarity, injecting prompt information in both encoder and decoder side using separate prompt encoder may enhance performance, which i consider now.
  1. ok i got it.
  2. yes i also trained on 24K
  3. mmm, i'm not sure which can help the speaker similarity yourtts-speaker cosine loss,hierspeech voice prompt、naturalspeech 2 speaker prompt in condition 、more audio feature extract function?