English text normalization utilization for Eager Streaming Mode
atiorh opened this issue · comments
Atila Orhon commented
- Eager Streaming Mode relies on confirming the currently predicted text tokens with at least 1 redundant historical prediction.
- Whisper is susceptible to outputting tokens that trivially differ (e.g.
"gonna"
vs"going to"
,"amortisation"
vs"amortization"
) for almost identical audio input. This happens occasionally and causes unnecessary slowdown due to missed opportunities to confirm predicted text tokens earlier. - #99 implements English Text Normalization which can be integrated into the token confirmation logic in Eager Streaming Mode to avoid these unnecessary slowdowns.
- Note that this would not intervene in the actually predicted tokens and the associated KV cache. This only changes the criterion for confirmation in "near matches with a trivial string variation".
Zach Nagengast commented
Utilities to help with this will be included with #120