Gadersd / whisper-burn

A Rust implementation of OpenAI's Whisper model using the burn framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Garbage output on multi channel audio and audio above 24khz

Quackdoc opened this issue · comments

Seems like audio decode is picky on what gets input to it

Audio mediainfo

General
Complete name                            : C:\Users\Quack\code\whisper-burn\slap.wav
Format                                   : Wave
File size                                : 788 KiB
Duration                                 : 4 s 203 ms
Overall bit rate mode                    : Constant
Overall bit rate                         : 1 536 kb/s
Writing application                      : Lavf58.29.100

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 4 s 203 ms
Bit rate mode                            : Constant
Bit rate                                 : 1 536 kb/s
Channel(s)                               : 2 channels
Sampling rate                            : 48.0 kHz
Bit depth                                : 16 bits
Stream size                              : 788 KiB (100%)

Audio file: https://cdn.discordapp.com/attachments/615105639567589376/1141946730485665893/slap.wav

target\release\whisper.exe .\slap.wav small_en        08/18/2023 12:07:51 AMLoading waveform...
Loading model...
Chunk 0:  (screaming)

Chunk 1:  (screeching)

Transcribed text:  (screeching)

whisper-ctranslate2:

whisper-ctranslate2.exe slap.wav --model tiny.en      08/18/2023 12:10:23 AM
Detected language 'English' with probability 1.000000
[00:00.000 --> 00:04.000]  Also, it's not always useful.
Transcription results written to 'C:\Users\Quack\code\whisper-burn' directory

EDIT: transcoding the audio file using ffmpeg -i .\slap.wav -ar SAMPLE_RATE -ac 1 slap-edit.wav seems to make it work, It needs to be both single channel as well as 41khz or less.

at 41khz the audio output was

Chunk 0:  Oh, son, it's not all you are.

Transcribed text:  Oh, son, it's not all you are.

at 24khz and below it is

Chunk 0:  also it's not always useful.

Transcribed text:  also it's not always useful

the whisper model itself expects 16Khz mono.

ah that make sense, I would assume burn doesn't do down sampling for the samplerate or for channel downmixing

This is partially addressed by 4080a33, but if I get the time I plan on looking into resampling and channel downmixing. I do have some work done, however I was using dasp which has proven it'self to be rather unusable, so im looking into different crates.

Looked into fon and it seems like it may work, but i don't like how it hasn't been active since feb'22.

currently looking into other crates

@Quackdoc have a look at https://github.com/HEnquist/rubato

It does what you need. I've had no success with the sync Ftt methods yet but SincFixedIn which is in their main example works well.

Here's how I'm using it - I have a pop at the end but the main downsampling is very good:

https://github.com/wavey-ai/soundkit/blob/75bf99c0e220bcfa380c6ae72e626257fb4790e0/src/audio_pipeline.rs#L67

(I had a feeling the Synchronous resampling FFT method might be better for wasm but haven't tested it and may have misunderstood what's its designed for, as the output is terribly distorted. Still investigating. Hopefully SincInterpolationType::Linear is good enough for real-time use cases)