How much data samples would I need to fine tune a new voice with a stable prompt ?

Question

How much data samples would I need to fine tune a new voice with a stable prompt ?

JacopoMangiavacchi opened this issue 5 months ago · comments

Jacopo Mangiavacchi commented 5 months ago

Thank you very much for sharing the receipt for fine tuning the LJSpeech dataset. I'm wondering if I can still train a new voice with a smaller dataset. With other model architecture I was able to clone a voice using something like 1 hour of training data. Could this be enough for EmotiVoice?

Thanks!

Yanqing Sun · Answer 1 · Wed Jan 24 2024 13:33:02 GMT+0800 (China Standard Time)

Yes, I believe that one hour of training data should be sufficient for EmotiVoice's Voice Cloning.

Jacopo Mangiavacchi · Answer 2 · Sat Jan 27 2024 03:23:58 GMT+0800 (China Standard Time)

Thank you! I've been fine tuning a new voice but I'm having issue inferencing this voice. In the LJSpeech fine-tuning receipt on step 5, when calling python inference_am_vocoder_exp.py, the parameter --logdir is missing and I see this is a mandatory argument for the script. I'm confused about the value to pass here.

Jacopo Mangiavacchi · Answer 3 · Sat Jan 27 2024 03:30:30 GMT+0800 (China Standard Time)

It looks like I'm able to pass '.' to logdir for concatenating the right path but then again the script complains for a missing config.json file in the exp/LJspeech/tmp/ folder. I can't find this config.json file. What it should contains ?

Yanqing Sun · Answer 4 · Mon Jan 29 2024 10:30:49 GMT+0800 (China Standard Time)

'logdir' is a required argument for 'inference_am_vocoder_joint.py', but it is not utilized in 'inference_am_vocoder_exp.py'.

Jacopo Mangiavacchi · Answer 5 · Tue Jan 30 2024 03:58:43 GMT+0800 (China Standard Time)

Thank you again @syq163, I was able to inference using the WangZeJun/simbert-base-chinese bert features. I see the script directly download these from HF repo. I only found the content and style subfolders in the exp/LJspeech/tmp/ folder.