huawei-noah / Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fine-tuning / Transfer Learning

williamluer opened this issue · comments

Is it possible in this repo to fine-tune a single or multi speaker model from a base pre-trained model?

yes , and it is working well

Is there any functionality within this repo to support that?

I added in code that trains a new multispeaker (or single speaker) model using a checkpoint from a pretrained multispeaker model. The pretrained model has a different number of speakers and speaker identities so I drop the layers associated with the speaker embeddings.

Unfortunately, I haven't been able to get results comparable to the audio clips shared on the GradTTS website: https://grad-tts.github.io/. I only have 5-15 minutes of data for ~10 speakers. I'm not sure if my implementation is incorrect in some way or if diffusion models require more data than that to get good results.

we didn't drop embedding layer but for any new speaker we set a speaker ID from (0,256) ,
and for any new speaker we had ~ 6-7 minutes audio .

Does setting a speaker ID mean that you overwrite the identity of a speaker from the dataset that the model was originally trained on or do you concatenate a new speaker to the embedding layer?

the first one , everwriting.

Perfect, thank you for the help!

Sorry to reopen this issue... But how exactly can one load a checkpoint into the training process and fine tune it using different audio samples from a single speaker dataset? I wanted to try fine tune training on one of my datasets (a single speaker) that I have with the provided pretrained LJSpeech checkpoint, but I'm not an expert in modifying the base code and got a little confused with the process... Do I make an LJSpeech folder but replace it with audio/transcription files from the said dataset I wanted to fine tune over the LJSpeech checkpoint?

@saeed5959 I am able to finetune with unseen speaker data of about 20mins by overwriting the speaker encoder. But not able to achieve the quality so far. Whats the recommended number of epochs for finetuning/any other hyper param recommendation you have here? Thanks

you should use multi-speaker pretrain model that has been trained in different speaker and learned how to generate different kind of voices . fine tune on ljspeech pretrain model is not going to give you a good result
500 epoch would be enough.
the length of youvr voices is better to be lower 10 seconds.
batch size is better to be small : 4 or 6.
one of the best approach is to use mix-training.
in the last point , instead of this model you can use vits model

@saeed5959 Are you suggesting Vits from coqui for finetuning which is an end to end model instead of GradTTS? I tried finetuning vctk model but the resultant audio has bad pronunciations and some noise in the audio compared to GradTTS. Hope to hear from you soon.

-Sagar

https://github.com/jaywalnut310/vits
you also need to train by yourself to get Discriminator pretrain. they havn't provided discriminator pretrain.