Discrepancy on WER benchmark result in Tedlium dataset
MLMonkATGY opened this issue · comments
Hi.
I am unable to reproduce the benchmark results in the paper for test split in distil-whisper/tedlium
using model distil-whisper/distil-large-v2
when using run_eval.py
. However, I am able to achieve reasonable benchmark in all others dataset benchmark reported in the paper (< 1% difference). Any idea what could have caused this discrepencies ?
I followed the suggestions in issue 131 which suggested usage of EnglishTextNormalizer
instead of BasicTextNormalizer
.
Reported WER from paper: 9.6%
Achieved WER : 12.69%
Difference : 3.09%
Command :
python run_eval.py \
--model_name_or_path "distil-whisper/distil-large-v2" \
--dataset_name "distil-whisper/tedlium" \
--dataset_config_name "release3" \
--dataset_split_name "test" \
--text_column_name "text" \
--batch_size 64 \
--dtype "bfloat16" \
--generation_max_length 256 \
--language "en" \
--attn_implementation "flash_attention_2"
Modification : Used EnglishTextNormalizer
as text normalizer
Thanks in advance.
I'm facing the same issue, only tedium has this discrepancy.