huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What are the settings used for WER calculation in the paper?

hidoba opened this issue · comments

Did you compare Whisper-large2 and distil-whisper on Transformers default settings (beam-size = 1, temperature = 1, do_sample = False)?

What would be the difference if you've used Open-ai settings (beam-size = 5)?

Yes, we evaluated using greedy search with no sampling. For beam size = 5, we see the following (with the abs WER reduction vs greedy):

Whisper-Large-v2 with num_beams=5

  • CHIME-4: 11.8 (-0.0 WER abs)
  • Earnings-22: 16.0 (-0.6 WER abs)
  • FLEURS: 3.9 (-0.3 WER abs)
  • SPGISpeech: 3.3 (-0.5 WER abs)

Distil-Whisper with num_beams=5

  • CHIME-4: 13.4 (-0.6 WER abs)
  • Earnings-22: 16.4 (-0.5 WER abs)
  • FLEURS: 6.1 (-0.2 WER abs)
  • SPGISpeech: 3.2 (-0.1 WER abs)

Relative speed-up of Distil-Whisper to Whisper for increasing batch size (bsz):

  • bsz=1: 5.5
  • bsz=4: 5.21
  • bsz=16: 3.01

=> speed-ups are very similar to what we achieved without beam search