Whisper Edge

Porting OpenAI Whisper speech recognition to edge devices with hardware ML accelerators, enabling always-on live voice transcription. Current work includes Jetson Nano and Coral Edge TPU.

Jetson Nano

Shopping cart

Part	Price (2023)
NVIDIA Jetson Nano Developer Kit (4G)	$149.00
ChanGeek CGS-M1 USB Microphone	$16.99
Noctua NF-A4x10 5V Fan (or similar, recommended)	$13.95
D-Link DWA-181 Wi-Fi Adapter (or similar, optional)	$21.94

Model

The base.en version of Whisper seems to work best for the Jetson Nano:

base is the largest model size that fits into the 4GB of memory without modification.
Inference performance with base is ~10x real-time in isolation and ~1x real-time while recording concurrently.
Using the english-only .en version further improves WER (<5% on LibriSpeech test-clean).

Hack

Dilemma:

Whisper and some of its dependencies require Python 3.8.
The latest supported version of JetPack for Jetson Nano is 4.6.3, which is on Python 3.6.
No easy way to update Python to 3.8 without losing CUDA support for PyTorch.

Workaround:

Fork whisper and tiktoken, downgrading them to Python 3.6.

Setup

First, follow the developer kit setup instructions, connect the Wi-Fi adapter and the microphone to USB, and ideally install a fan. (Also plugging in an Ethernet cable helps to make the downloads faster.) Then, get a shell on the Jetson Nano:

ssh user@jetson-nano.local

We will use NVIDIA Docker containers to run inference. Get the source code and build the custom container:

git clone https://github.com/maxbbraun/whisper-edge.git
bash whisper-edge/build.sh

Run

Launch inference:

bash whisper-edge/run.sh

You should see console output similar to this:

I0317 00:42:23.979984 547488051216 stream.py:75] Loading model "base.en"...
100%|#######################################| 139M/139M [00:30<00:00, 4.71MiB/s]
I0317 00:43:14.232425 547488051216 stream.py:79] Warming model up...
I0317 00:43:55.164070 547488051216 stream.py:86] Starting stream...
I0317 00:44:19.775566 547488051216 stream.py:51]
I0317 00:44:22.046195 547488051216 stream.py:51] Open AI's mission is to ensure that artificial general intelligence
I0317 00:44:31.353919 547488051216 stream.py:51] benefits all of humanity.
I0317 00:44:49.219501 547488051216 stream.py:51]

The stream.py script run in the container accepts flags for different configurations:

bash whisper-edge/run.sh --help

       USAGE: stream.py [flags]
flags:

stream.py:
  --channel_index: The index of the channel to use for transcription.
    (default: '0')
    (an integer)
  --chunk_seconds: The length in seconds of each recorded chunk of audio.
    (default: '10')
    (an integer)
  --input_device: The input device used to record audio.
    (default: 'plughw:2,0')
  --language: The language to use or empty to auto-detect.
    (default: 'en')
  --latency: The latency of the recording stream.
    (default: 'low')
  --model_name: The version of the OpenAI Whisper model to use.
    (default: 'base.en')
  --num_channels: The number of channels of the recorded audio.
    (default: '1')
    (an integer)
  --sample_rate: The sample rate of the recorded audio.
    (default: '16000')
    (an integer)

Try --helpfull to get a list of all flags.

Troubleshooting

To see if the microphone is working properly, use alsa-utils:

sudo apt-get -y install alsa-utils

# Is the USB device connected?
lsusb

# Is the correct recording device selected?
arecord -l

# Is the gain set properly?
alsamixer

# Does a test recording work?
arecord --format=S16_LE --duration=5 --rate=16000 --channels=1 --device=plughw:2,0 test.wav

Coral Edge TPU

See the corresponding issue about what supporting the Google Coral Edge TPU may look like.

jacobhq / whisper-tmp