whisper automatic-speech-recognition czech fine-tuning huggingface

Whisper fine tuning event 2022 - script modification

Last setup what was used for training best fine tuned model for Whisper in HuggingFace Fine tuning event 2022.

DeepSpeed

First modification was to get access to bigger batch_size without gradient_accumulation_steps using DeepSpeed.

To make it run inside Docker, I've used guide from Zihao's blogpost.

Concatenation of input dataset

It was idea from Bayar. Whisper model uses 30 second batches, but Common Voice dataset is around 3-5 seconds of audio in each sample. We can concatenate audio and text together to fewer samples. To learn from more dense data. It should run faster and learn a lot more from each sample.

Other ideas

According to some details of training Large v2 model in Whisper paper I have some ideas to try in next steps.

SpecAugment: A New Data Augmentation Method for Automatic Speech Recognition
More data collected from other datasets (google/fleurs, common voice 12/13) and combination with Farsipal multistreaming modification.
PyTorch 2.0 optimization
Collect custom dataset to get more training data
- creative commons filter on YouTube and videos with subtitles.
- download videos without subtitles, use whisper to get some and manually fix them ASR corpus creator

Thanks for Whisper Fine tuning event 2022

HuggingFace crew - for event itself and all support on discord
- Sanchit Gandhi,
- Vaibhav Srivastav - VB
LabmdaLabs - for all GPU hours (insane 20k+ !!!)
OpenAI for Whisper model

About

Final training script from HuggingFace Whisper Fine tuning event - to get best results on finetuned model.

whisper automatic-speech-recognition czech fine-tuning huggingface

Languages

Language:Python 95.7%Language:Shell 4.3%