Discrepancy on WER benchmark result in Tedlium dataset

Question

Discrepancy on WER benchmark result in Tedlium dataset

MLMonkATGY opened this issue 2 months ago · comments

Hi.

I am unable to reproduce the benchmark results in the paper for test split in distil-whisper/tedlium using model distil-whisper/distil-large-v2 when using run_eval.py. However, I am able to achieve reasonable benchmark in all others dataset benchmark reported in the paper (< 1% difference). Any idea what could have caused this discrepencies ?

I followed the suggestions in issue 131 which suggested usage of EnglishTextNormalizer instead of BasicTextNormalizer .

Reported WER from paper: 9.6%
Achieved WER : 12.69%
Difference : 3.09%

Command :

python run_eval.py \
  --model_name_or_path "distil-whisper/distil-large-v2" \
  --dataset_name "distil-whisper/tedlium" \
  --dataset_config_name "release3" \
  --dataset_split_name "test" \
  --text_column_name "text" \
  --batch_size 64 \
  --dtype "bfloat16" \
  --generation_max_length 256 \
  --language "en" \
  --attn_implementation "flash_attention_2"

Modification : Used EnglishTextNormalizer as text normalizer

Thanks in advance.

Yi Zhu · Answer 1 · Tue Jun 04 2024 13:27:31 GMT+0800 (China Standard Time)

I'm facing the same issue, only tedium has this discrepancy.