argmaxinc / WhisperKit

On-device Inference of Whisper Speech Recognition Models for Apple Silicon

Home Page:https://takeargmax.com/blog/whisperkit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

English text normalization utilization for Eager Streaming Mode

atiorh opened this issue · comments

  • Eager Streaming Mode relies on confirming the currently predicted text tokens with at least 1 redundant historical prediction.
  • Whisper is susceptible to outputting tokens that trivially differ (e.g. "gonna" vs "going to", "amortisation" vs "amortization") for almost identical audio input. This happens occasionally and causes unnecessary slowdown due to missed opportunities to confirm predicted text tokens earlier.
  • #99 implements English Text Normalization which can be integrated into the token confirmation logic in Eager Streaming Mode to avoid these unnecessary slowdowns.
  • Note that this would not intervene in the actually predicted tokens and the associated KV cache. This only changes the criterion for confirmation in "near matches with a trivial string variation".

Utilities to help with this will be included with #120