collabora / WhisperSpeech

An Open Source text-to-speech system built by inverting Whisper.

Home Page:https://collabora.github.io/WhisperSpeech/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fine-tuning.

HobisPL opened this issue · comments

Can you write more about training, how the dataset should look like, etc.? I see that you are from Poland, do you plan to add more Polish voices? Because the current model struggles with accents and style.

I don't have more Polish data that is permissively licensed. One thing I am looking forward to is adding more languages – hopefully this would improve performance on all languages, like it did for Whisper.

I don't have more Polish data that is permissively licensed. One thing I am looking forward to is adding more languages – hopefully this would improve performance on all languages, like it did for Whisper.

Sure, I understand. Will you provide any instructions on how to do fine-tuning and what the TXT/CSV file should look like? Is this a standard format?
audio_file_name|text|speaker_name
Alternatively, should I create a Google Colab notebook for this?

I'm intrested in doing this for swedish i found some audiobooks I could use.
But I would be interested in what kind of hardware it requires, expected time and so on.
Are there any resources on this?

I am working writing down the full process for data preprocessing. It's a bit involved because we need to scale it for 1000s of hours but for smaller fine-tuning datasets someone should be able to put all of it into a single notebook with reasonable runtime.

If I want to add a new language to WhisperSpeech, will fine-tuning archive it? Also, did the audio of the dataset is limited to 1 speaker only? It's difficult to find a dataset with 1000 hours of length with only 1 speaker... If different speakers speak with 1 single language will it work?

@jpc

Any update on this? how can i fine-tuning if i have a chinese audio dataset?

@jpc
Please also tell me the dataset requirement, as I mentioned above, thank you