Gadersd / whisper-burn

A Rust implementation of OpenAI's Whisper model using the burn framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Muffled Text - Response for Custom Audio

coderaidershaun opened this issue · comments

Hi there, first of all, awesome library and thank you for making it.

When I record a short clear .wav file saying, "this is a test, this is a test" (link below), the waveform_to_text function does not successfully decode it. In essence, it usually shows something like "muffled" as the decoded text. I have used the medium sized model too. However, it does work for the audio.wav model provided in the repo example.

Have spent much time trying to analyze why this is failing, including performing analysis on the meta-data of both files, and also recording new audio files from difference sources (just in case this was to do with my own machine).

Do you have any experience or knowledge on the exact requirements for the .wav file in order for it to be successfully extracted using the library?

Here is the audio file which is failing: https://drive.google.com/file/d/1aaWL-mBrRaGtFvL_Re8r4WVtsMP3BmAI/view?usp=sharing

Again, this works great with the example audio file provided in the docs, but just not with any new custom file I record.

Found the issue thanks to some of the other threads here.

It is important that the audio file has a sample rate of 16kHz and is mono (single channel). All software I used to record audio was using a sample rate of over 44.1k (for example, davinci resolve does not allow you to record at a lower sample rate). Therefore, I cheated and asked chatGPT code interpreter to convert to 16kHz for me on a mono file. It has picked up some of the words.

UPDATE: Using mono and 22050Hz sample rate (to match the example audio file) worked best.