Incorrect timestamps (0.5sec off)

Question

Incorrect timestamps (0.5sec off)

finnvoor opened this issue 4 months ago · comments

The timings of segments/words are sometimes inaccurate. When the attached audio is transcribed (we’re using base.en, but it seems to happen with larger models too), a lot of the segments have start times ~0.5sec after their actual start times. In the example, the word “Like” in “Like before…” should begin at 13.6s but WhisperKit is giving us 14.12s. This is happening for 5 out of the 10 segments in this audio.

I noticed that the segment contains a timing token with an accurate time of 13.6, but it uses 14.12 instead.

WhisperKit.TranscriptionSegment(….start: 14.12, end: 21.7, text: "<|13.04|><|13.60|> Like before, this balloon is still filled with mostly hydrogen. However, this time, about one third of it is oxygen.<|21.28|>”, ..., words: Optional([WhisperKit.WordTiming(word: " Like", tokens: [4525], start: 14.12, end: 14.44, probability: 0.8)

When word timestamps are disabled, the segment gets a start time of 13.04, which doesn't account for all the silence.

out.m4a.zip

Atila Orhon · Answer 1 · Thu Apr 04 2024 05:58:15 GMT+0800 (China Standard Time)

Thanks for the report @finnvoor! We started relying on the accuracy of word timestamps in streaming mode too. This is important, so we will triage and address it.

Atila Orhon · Answer 2 · Thu Apr 04 2024 06:00:39 GMT+0800 (China Standard Time)

Low-hanging fruits:

Leverage the redundancy in segment and word-level timestamps for consistency checks
Implement median filtering in DTW as in the original implementation even though it didn't have a major impact in our early tests

Zach Nagengast · Answer 3 · Sat Apr 06 2024 08:43:37 GMT+0800 (China Standard Time)

Quick update, I've identified the issue and am putting together a patch for this now.

Atila Orhon · Answer 4 · Fri Apr 12 2024 11:58:50 GMT+0800 (China Standard Time)

@finnvoor Please confirm that this fixes your issue 🙏

Finn Voorhees · Answer 5 · Fri Apr 12 2024 15:14:39 GMT+0800 (China Standard Time)

@ZachNagengast @atiorh gave it a quick test and the start times seem much more precise, thanks for the quick improvement.

It does seem like this has made the end times of words/segments slightly worse though. Previously, the end times would sometimes include some silence (be too late), but they never seemed to include any of the last word, so were good for splitting after a word/sentence. Now it seems like it accounts for silence at the end better, but seems to go a bit too far and includes the end of the word. In the same example at ~4s the word "gas" used to end at 4.06, now ends at 3.62, but should end at ~3.8.

We'll continue to test it a bit more today.

Zach Nagengast · Answer 6 · Sat Apr 13 2024 01:21:46 GMT+0800 (China Standard Time)

I see, good to know, we might be able to improve this with some VAD (shift the end time to the last point that the sound level was past a threshold), but this is also the same endpoint that openai/whisper gives for their word timestamps, so it might be a model issue, or need a bit more massaging to get perfect. There are many such so called "hacks" in the main repo that could be improved.

For detail: the reason it ended that far past the audio previously is because we were including the punctuation token ".", which has non-zero length, as part of the word's end time, the fix removed that time entirely, so it is ending exactly where it things the word "gas" ends, before the punctuation. Next step may be to consider some middle ground where the punctuation counts for some time but not the full token because it's not a spoken word. Open to ideas here too!

Finn Voorhees · Answer 7 · Wed Apr 17 2024 17:03:49 GMT+0800 (China Standard Time)

Got it, figured eventually we'd run into model limits. I think in our case I'll try just adding a small offset to the end since it seems pretty consistent, and in general adding silence is better than cutting words. VAD would be really nice but sounds a bit tricky to implement.