MiscellaneousStuff / openai-whisper-cpu

Improving transcription performance of OpenAI Whisper for CPU based deployment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OpenAI Whisper - CPU

About

Experiments applying quantization methods to OpenAI Whisper ASR model to improve the inference speed and throughput on CPU-based deployments. This is motivated by the fact that, although the Whisper model greatly improves the accessibility of SOTA ASR and doesn't require depending on the cloud for high quality transcription, many end users can not run this model out-of-the-box as most consumer computers only contain CPUs and do not contain high performance GPUs.

This could lead to allowing the larger Whisper models to run faster on laptops without a GPU.

Hardware for experiments:
CPU - AMD Ryzen 5 5600X
RAM - 32GB DDR4
GPU - Nvidia GeForce RTX 3060 Ti
HDD - M.2 SSD

Usage

Firstly, get the fork of the OpenAI Whisper repo with the modifications needed for CPU dynamic quantization:

git submodule init
git submodule update

And then install the module using:

pip install -e ./whisper

Explanation

Quantization of the Whisper model requires changing the Linear() layers within the model to nn.Linear(). This is because you need to specifiy which layer types to dynamically quantize, such as:

quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

However the whisper model is designed to be adaptable, i.e. it can run at different precisions, so the Linear() layer contains custom code to account for this. However, this is not required for the quantized model. You can either change the Linear() layers in "/whisper/whisper/model.py" yourself, or you can just use the above installation instructions.

Results

Test audio is the first 30 seconds of:
https://www.youtube.com/watch?v=oKOtzIo-uYw

Device Whisper Model Data Type Linear Layer Inference Time
GPU tiny fp32 Linear 0.5
CPU tiny fp32 nn.Linear 2.3
CPU tiny qint8 (quant) nn.Linear 3.1 (0.74x slowdown)

Tiny quantized model is 9.67x faster than real time.
Tiny quantized model is 0.74x slower than the original model.

Device Whisper Model Data Type Linear Layer Inference Time
GPU base fp32 Linear 0.6
CPU base fp32 nn.Linear 5.2
CPU base qint8 (quant) nn.Linear 3.2 (1.62x speedup)

Base quantized model is 9.37x faster than real time.
Base quantized model is 1.62x faster than the original model.

Device Whisper Model Data Type Linear Layer Inference Time
GPU small fp32 Linear 0.7
CPU small fp32 nn.Linear 19.1s
CPU small qint8 (quant) nn.Linear 6.9s (2.76x speedup)

Small quantized model is 4.34x faster than real time.
Small quantized model is 2.76x faster than the original model.

Device Whisper Model Data Type Linear Layer Inference Time
GPU medium fp32 Linear 1.7s
CPU medium fp32 nn.Linear 60.7
CPU medium qint8 (quant) nn.Linear 23.1 (2.62x speedup)

Medium quantized model is 1.29x faster than real time.
Medium quantized model is 2.62x faster than the original model.

Docker

Build the docker image.

docker build -t whisper-cpu . 

Run the quantized model.

docker run --rm -v "$(pwd)/audio":/usr/src/app/audio -v "$(pwd)/script":/usr/src/app/script whisper-cpu python3 ./script/custom_whisper.py audio/path_to_dir_or_audio_file --language English --model medium.en 
  • -v "$(pwd)/audio":/usr/src/app/audio this creates a volume to give docker access to your audio files.

  • -v "$(pwd)/script":/usr/src/app/script this volume gives docker access to the custom start script. Transcription results are also stored here.

  • Note: you might want to adjust ./script/custom_whisper.py for your own needs.

About

Improving transcription performance of OpenAI Whisper for CPU based deployment

License:MIT License


Languages

Language:Jupyter Notebook 89.5%Language:Python 8.1%Language:Dockerfile 2.4%