Short form evaluation WER % for Librispeech clean test

Question

Short form evaluation WER % for Librispeech clean test

guynich opened this issue 7 months ago · comments

Hi, I'm enjoying working with this fascinating repo.

Looking at Stage 4 short form evaluation, I modified the short form evaluation bash script for Librispeech clean dataset (test split) for OpenAI Large-v2 model here and Small model here.

The generated WER % results are higher than the HuggingFace model card evaluation WER results which is unexpected.

E.g.:

model	script eval/wer	HF model card WER
OpenAI Large-v2	3.1683	3.0004
OpenAI Small	4.0682	3.4322

Any suggestions what might be causing these WER value differences (perhaps my short form eval bash scripts) ?

Guy Nicholson · Answer 1 · Sun Feb 18 2024 03:37:42 GMT+0800 (China Standard Time)

The above table is with --language "en" in the short form bash scripts. By removing this flag and rerunning the evaluation the eval/wer values are lower.

E.g.:

model	eval/wer with `--language "en"`	eval/wer without option `--language`	HF model card WER
OpenAI Large-v2	3.1683	2.5685	3.0004
OpenAI Small	4.0682	3.44541	3.4322

Without the --language flag:

Large-v2 model eval/wer is lower than the HuggingFace model card WER value, and lower than the original OpenAI paper result of 2.7% in Table 2.
Small model eval/wer is similar to the HuggingFace model card WER value.

Guy Nicholson · Answer 2 · Wed Feb 21 2024 02:44:11 GMT+0800 (China Standard Time)

Added Tiny model script and result here: https://github.com/guynich/distil-whisper/tree/main/training/scripts#summary.

Guy Nicholson · Answer 3 · Wed Feb 21 2024 02:48:14 GMT+0800 (China Standard Time)

I'm closing this issue: the small and tiny model results for HF model card and eval/wer without option --language are aligned sufficiently for me.

(I don't understand the discrepancy in values for Large-V2 but can leave that issue)