Quantize distil-whisper?

Question

Quantize distil-whisper?

sujitvasanth opened this issue 5 months ago · comments

Hi I was wondering if there would be any speed gain and size reduction for quantizing distil whisper?
i.e bits and bytes, Onyx, GPTQ
There is a gain from quantizing the whisper model itself without much quality loss - see here
https://medium.com/@daniel-klitzke/quantizing-openais-whisper-with-the-huggingface-optimum-library-30-faster-inference-64-36d9815190e0

You may wonder why to quantize - I am running several models simultaneously in an AI assistant that uses an LLM (openchat quantized), multimodal visual model (LLAVA or moondream), wakeword model (openwakeword). It will run on my device 24Gb VRAM but I wanted to share with as many users as possible so to keep the VRAM usage low.

I was looking to quantize the large v3 model as it had the lowest word error rate and was the second fastest or perhaps the medium.en model..

can anyone point me in the direction of a quantised version of distil-whisper or how I can generate one and use it for inference?