Fine-tuning / Transfer Learning

Question

Fine-tuning / Transfer Learning

williamluer opened this issue 3 years ago · comments

williamluer commented 3 years ago

Is it possible in this repo to fine-tune a single or multi speaker model from a base pre-trained model?

saeed firouzi · Answer 1 · Tue Nov 23 2021 17:47:57 GMT+0800 (China Standard Time)

yes , and it is working well

williamluer · Answer 2 · Tue Nov 23 2021 21:28:20 GMT+0800 (China Standard Time)

Is there any functionality within this repo to support that?

I added in code that trains a new multispeaker (or single speaker) model using a checkpoint from a pretrained multispeaker model. The pretrained model has a different number of speakers and speaker identities so I drop the layers associated with the speaker embeddings.

Unfortunately, I haven't been able to get results comparable to the audio clips shared on the GradTTS website: https://grad-tts.github.io/. I only have 5-15 minutes of data for ~10 speakers. I'm not sure if my implementation is incorrect in some way or if diffusion models require more data than that to get good results.

saeed firouzi · Answer 3 · Wed Nov 24 2021 17:04:02 GMT+0800 (China Standard Time)

we didn't drop embedding layer but for any new speaker we set a speaker ID from (0,256) ,
and for any new speaker we had ~ 6-7 minutes audio .

williamluer · Answer 4 · Wed Nov 24 2021 21:32:04 GMT+0800 (China Standard Time)

Does setting a speaker ID mean that you overwrite the identity of a speaker from the dataset that the model was originally trained on or do you concatenate a new speaker to the embedding layer?

saeed firouzi · Answer 5 · Thu Nov 25 2021 15:27:37 GMT+0800 (China Standard Time)

the first one , everwriting.

williamluer · Answer 6 · Thu Nov 25 2021 21:37:52 GMT+0800 (China Standard Time)

Perfect, thank you for the help!

HoodyP · Answer 7 · Mon Dec 27 2021 12:25:11 GMT+0800 (China Standard Time)

Sorry to reopen this issue... But how exactly can one load a checkpoint into the training process and fine tune it using different audio samples from a single speaker dataset? I wanted to try fine tune training on one of my datasets (a single speaker) that I have with the provided pretrained LJSpeech checkpoint, but I'm not an expert in modifying the base code and got a little confused with the process... Do I make an LJSpeech folder but replace it with audio/transcription files from the said dataset I wanted to fine tune over the LJSpeech checkpoint?

Sagar Raikar · Answer 8 · Tue Jan 10 2023 13:27:21 GMT+0800 (China Standard Time)

@saeed5959 I am able to finetune with unseen speaker data of about 20mins by overwriting the speaker encoder. But not able to achieve the quality so far. Whats the recommended number of epochs for finetuning/any other hyper param recommendation you have here? Thanks

saeed firouzi · Answer 9 · Tue Jan 10 2023 15:09:24 GMT+0800 (China Standard Time)

you should use multi-speaker pretrain model that has been trained in different speaker and learned how to generate different kind of voices . fine tune on ljspeech pretrain model is not going to give you a good result
500 epoch would be enough.
the length of youvr voices is better to be lower 10 seconds.
batch size is better to be small : 4 or 6.
one of the best approach is to use mix-training.
in the last point , instead of this model you can use vits model

Sagar Raikar · Answer 10 · Thu Jan 26 2023 19:56:21 GMT+0800 (China Standard Time)

@saeed5959 Are you suggesting Vits from coqui for finetuning which is an end to end model instead of GradTTS? I tried finetuning vctk model but the resultant audio has bad pronunciations and some noise in the audio compared to GradTTS. Hope to hear from you soon.

-Sagar

saeed firouzi · Answer 11 · Fri Jan 27 2023 20:50:00 GMT+0800 (China Standard Time)

https://github.com/jaywalnut310/vits
you also need to train by yourself to get Discriminator pretrain. they havn't provided discriminator pretrain.