Hello. pretrained models are released to generate text to speech?
FurkanGozukara opened this issue · comments
Could you inform me about this? thank you
Do you have any information?
Thank you for having an interest in our work! And, sorry for my late reply.
We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.
Thank you for having an interest in our work! And, sorry for my late reply. We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.
Thank you. Without pre trained model this repo is useless unfortunately :/ if you also decide to release model please also make a simple Gradio interface too
We've just released pretrained models. We'd be glad if this would be useful. Thanks!
We've just released pretrained models. We'd be glad if this would be useful. Thanks!
thank you so much amazing
can we fine tune a voice on it to generate that new voice? how many steps would it take or how hard it would be?
@FurkanGozukara
We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.
BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.
Best regards.
@FurkanGozukara We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.
BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.
Best regards.
OK 100k steps sounding great
Any tutorial documentation about how to find tune and prepare training data?
My aim is obtaining eleven labs quality with fine tuning
-
Put the pretrained model files
g_10000000
anddo_10000000
atexp/bigvsan
. -
Create a directory for your data, e.g.
YourData
-
Make filename lists imitating
LibriTTS/train-full.txt
andLibriTTS/val-full.txt
. If you don't have a validation split, making only a training file list and copying it is OK. -
Place the filename list at your created directory
YourData
-
Modify the following lines
Lines 399 to 400 in ea179e8
as follows:
parser.add_argument('--list_input_unseen_wavs_dir', default=[])
parser.add_argument('--list_input_unseen_validation_file', default=[])
- Run the following script (almost the same as https://github.com/sony/bigvsan#training)
python train.py \
--config configs/bigvsan_24khz_100band.json \
--input_wavs_dir YourData \
--input_training_file YourData/train-full.txt \
--input_validation_file YourData/val-full.txt \
--checkpoint_path exp/bigvsan
If you don't have a validation split, adding the --debug True
option makes training faster.
- You will get a fine-tuned generator
g_10100000
(anddo_10100000
) after 100k steps.
Please note that this implementation is being provided basically for research purposes. There should be room for improvement in the implementation from the perspective of practitioners. Anyway, thank you for your interest in our work!
3. LibriTTS/train-full.txt
thank you so much here few questions
in YourData folder there will be wav files. any format for them? like they should be between 5 second to 15 seconds etc? anything else?
i assume LibriTTS/train-full.txt will have file names to the files inside YourData folder and nothing else
just like
a (1).wav
a (2).wav
a (3).wav
a (4).wav
Do we need transcription of wav files or just speaking files with no text file? thank you so much for answers
By the way I just noticed something. Your model can't generate voice from text right?
Sorry but where is a pretrained weights?
@ex3ndr Thank you for your interest!
The details are here, https://github.com/sony/bigvsan?tab=readme-ov-file#pretrained-models. You can download a pretrained model here, https://zenodo.org/records/10037439.