English text normalization utilization for Eager Streaming Mode

Question

atiorh opened this issue 2 months ago · comments

Eager Streaming Mode relies on confirming the currently predicted text tokens with at least 1 redundant historical prediction.
Whisper is susceptible to outputting tokens that trivially differ (e.g. "gonna" vs "going to", "amortisation" vs "amortization") for almost identical audio input. This happens occasionally and causes unnecessary slowdown due to missed opportunities to confirm predicted text tokens earlier.
#99 implements English Text Normalization which can be integrated into the token confirmation logic in Eager Streaming Mode to avoid these unnecessary slowdowns.
Note that this would not intervene in the actually predicted tokens and the associated KV cache. This only changes the criterion for confirmation in "near matches with a trivial string variation".

Zach Nagengast · Answer 1 · Tue May 07 2024 23:53:50 GMT+0800 (China Standard Time)

Utilities to help with this will be included with #120