argmaxinc / WhisperKit

On-device Inference of Whisper Speech Recognition Models for Apple Silicon

Home Page:https://takeargmax.com/blog/whisperkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect timestamps (0.5sec off)

finnvoor opened this issue · comments

The timings of segments/words are sometimes inaccurate. When the attached audio is transcribed (we’re using base.en, but it seems to happen with larger models too), a lot of the segments have start times ~0.5sec after their actual start times. In the example, the word “Like” in “Like before…” should begin at 13.6s but WhisperKit is giving us 14.12s. This is happening for 5 out of the 10 segments in this audio.

I noticed that the segment contains a timing token with an accurate time of 13.6, but it uses 14.12 instead.

WhisperKit.TranscriptionSegment(….start: 14.12, end: 21.7, text: "<|13.04|><|13.60|> Like before, this balloon is still filled with mostly hydrogen. However, this time, about one third of it is oxygen.<|21.28|>”, ..., words: Optional([WhisperKit.WordTiming(word: " Like", tokens: [4525], start: 14.12, end: 14.44, probability: 0.8)

When word timestamps are disabled, the segment gets a start time of 13.04, which doesn't account for all the silence.

out.m4a.zip

Thanks for the report @finnvoor! We started relying on the accuracy of word timestamps in streaming mode too. This is important, so we will triage and address it.

Low-hanging fruits:

  • Leverage the redundancy in segment and word-level timestamps for consistency checks
  • Implement median filtering in DTW as in the original implementation even though it didn't have a major impact in our early tests

Quick update, I've identified the issue and am putting together a patch for this now.

@finnvoor Please confirm that this fixes your issue 🙏

@ZachNagengast @atiorh gave it a quick test and the start times seem much more precise, thanks for the quick improvement.

It does seem like this has made the end times of words/segments slightly worse though. Previously, the end times would sometimes include some silence (be too late), but they never seemed to include any of the last word, so were good for splitting after a word/sentence. Now it seems like it accounts for silence at the end better, but seems to go a bit too far and includes the end of the word. In the same example at ~4s the word "gas" used to end at 4.06, now ends at 3.62, but should end at ~3.8.

Logic Pro - Untitled - Tracks@2x

We'll continue to test it a bit more today.

I see, good to know, we might be able to improve this with some VAD (shift the end time to the last point that the sound level was past a threshold), but this is also the same endpoint that openai/whisper gives for their word timestamps, so it might be a model issue, or need a bit more massaging to get perfect. There are many such so called "hacks" in the main repo that could be improved.

For detail: the reason it ended that far past the audio previously is because we were including the punctuation token ".", which has non-zero length, as part of the word's end time, the fix removed that time entirely, so it is ending exactly where it things the word "gas" ends, before the punctuation. Next step may be to consider some middle ground where the punctuation counts for some time but not the full token because it's not a spoken word. Open to ideas here too!

Got it, figured eventually we'd run into model limits. I think in our case I'll try just adding a small offset to the end since it seems pretty consistent, and in general adding silence is better than cutting words. VAD would be really nice but sounds a bit tricky to implement.