sony / bigvsan

Pytorch implementation of BigVSAN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hello. pretrained models are released to generate text to speech?

FurkanGozukara opened this issue · comments

Could you inform me about this? thank you

@FurkanGozukara

Do you have any information?

Thank you for having an interest in our work! And, sorry for my late reply.
We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.

Thank you for having an interest in our work! And, sorry for my late reply. We're thinking of releasing a pretrained checkpoint, but we're now conducting additional experiments. For example, we're investgating whether we can get better results by training a model for more than 1M steps. We will make a decision after we finish them (maybe in one or two months). Sorry for keeping you waiting.

Thank you. Without pre trained model this repo is useless unfortunately :/ if you also decide to release model please also make a simple Gradio interface too

We've just released pretrained models. We'd be glad if this would be useful. Thanks!

We've just released pretrained models. We'd be glad if this would be useful. Thanks!

thank you so much amazing

can we fine tune a voice on it to generate that new voice? how many steps would it take or how hard it would be?

@FurkanGozukara
We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.

BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.

Best regards.

@FurkanGozukara We've never fine-tuned a pretrained BigVSAN model, and then I have no idea on how many steps are necessary. The following is just my guess.

BigVGAN, which our BigVSAN is built based on, was proposed as a universal vocoder. This means it can generate voices of unseen speakers quite well without fine-tuning. Actually, our evaluation shows BigVSAN can generate voices of unseen speakers very well without fine-tuning as well. You can check that on our demo page, where every voice provided is from an unseen speaker. But, of course, fine-tuning will enable a model to generate higher-fidelity speech of a specific speaker. How many steps are required depends on the size of your dataset, but 0.1M may be sufficient because a pretrained model can already generate voices of unseen speakers quite well.

Best regards.

OK 100k steps sounding great

Any tutorial documentation about how to find tune and prepare training data?

My aim is obtaining eleven labs quality with fine tuning

  1. Put the pretrained model files g_10000000 and do_10000000 at exp/bigvsan.

  2. Create a directory for your data, e.g. YourData

  3. Make filename lists imitating LibriTTS/train-full.txt and LibriTTS/val-full.txt. If you don't have a validation split, making only a training file list and copying it is OK.

  4. Place the filename list at your created directory YourData

  5. Modify the following lines

    bigvsan/train.py

    Lines 399 to 400 in ea179e8

    parser.add_argument('--list_input_unseen_wavs_dir', nargs='+', default=['LibriTTS', 'LibriTTS'])
    parser.add_argument('--list_input_unseen_validation_file', nargs='+', default=['LibriTTS/dev-clean.txt', 'LibriTTS/dev-other.txt'])

    as follows:

 parser.add_argument('--list_input_unseen_wavs_dir', default=[]) 
 parser.add_argument('--list_input_unseen_validation_file', default=[]) 
  1. Run the following script (almost the same as https://github.com/sony/bigvsan#training)
python train.py \
--config configs/bigvsan_24khz_100band.json \
--input_wavs_dir YourData \
--input_training_file YourData/train-full.txt \
--input_validation_file YourData/val-full.txt \
--checkpoint_path exp/bigvsan

If you don't have a validation split, adding the --debug True option makes training faster.

  1. You will get a fine-tuned generator g_10100000 (and do_10100000) after 100k steps.

Please note that this implementation is being provided basically for research purposes. There should be room for improvement in the implementation from the perspective of practitioners. Anyway, thank you for your interest in our work!

3. LibriTTS/train-full.txt

thank you so much here few questions

in YourData folder there will be wav files. any format for them? like they should be between 5 second to 15 seconds etc? anything else?

image

i assume LibriTTS/train-full.txt will have file names to the files inside YourData folder and nothing else

just like

a (1).wav
a (2).wav
a (3).wav
a (4).wav

Do we need transcription of wav files or just speaking files with no text file? thank you so much for answers

By the way I just noticed something. Your model can't generate voice from text right?

Sorry but where is a pretrained weights?

@ex3ndr Thank you for your interest!

The details are here, https://github.com/sony/bigvsan?tab=readme-ov-file#pretrained-models. You can download a pretrained model here, https://zenodo.org/records/10037439.