Very high WER % in extensive benchmark on Fleurs

Question

Very high WER % in extensive benchmark on Fleurs

asusdisciple opened this issue 7 months ago · comments

On which dataset did you exactly evaluate the model? I benchmarked this model on the original Fleurs dataset, along with all other implementations of whisper. It performed way worse with a WER% of 1.5, compared to ~0.46 in original whisper. Did I make an implementation error?

Here is how I initialize the model with temp=0, beams=1, do_sample=True:

model = "distil-whisper/distil-large-v2"
        obj = pipeline(model=model,
                       torch_dtype=parameters["torch_dtype"],
                       device=device,  # or mps for Mac devices
                       chunk_length_s=15,
                       batch_size=parameters["batch_size"],
                       return_timestamps=False,
                       model_kwargs={"use_flash_attention_2": parameters["flash"]},
                       generate_kwargs={"task": "transcribe",
                                        "num_beams": parameters["beam_size"],
                                        "temperature": parameters["temperature"],
                                        "do_sample": parameters["do_sample"]
                                        }
                       )
        if not parameters["flash"]:
            logging.debug("Using Better Transformers without Flash Attention")
            obj.model = obj.model.to_bettertransformer()
        else:
            logging.debug("Using Flash Attention 2")

This how I call the transcribe:

tmp = model(audiopath,
                    generate_kwargs={"language": lang}
                    )
res = [i["text"] for i in tmp]

Sanchit Gandhi · Answer 1 · Tue Jan 16 2024 02:07:11 GMT+0800 (China Standard Time)

Hey @asusdisciple - what language were you using? It would be really helpful to have a reproducible end-to-end script I can use to get the same results that you're reporting

We use the script run_eval.py with the following launch command: https://github.com/huggingface/distil-whisper/blob/main/training/flax/evaluation_scripts/test/run_distilled.sh

If you execute this, you'll get the results quoted in the paper.

asusdisciple · Answer 2 · Fri Jan 19 2024 17:51:07 GMT+0800 (China Standard Time)

I am sorry, I just overread that distil-whisper is english only (even though it performs very well on a few other languages as well).

Sanchit Gandhi · Answer 3 · Fri Jan 19 2024 18:11:24 GMT+0800 (China Standard Time)

Hey @asusdisciple, no worries! If you're interested in a training Whisper on a different language, you can leverage the training code under distil-whisper/training. I recommend first setting up a baseline using these instructions: https://huggingface.co/sanchit-gandhi/distil-whisper-large-v3-de-kd#training-procedure