Speech-to-Text Benchmark

Made in Vancouver, Canada by Picovoice

This repo is a minimalist and extensible framework for benchmarking different speech-to-text engines.

Data
Metrics
Engines
Usage
Results

Data

Metrics

Word Error Rate

Word error rate (WER) is the ratio of edit distance between words in a reference transcript and the words in the output of the speech-to-text engine to the number of words in the reference transcript.

Real Time Factor

Real-time factor (RTF) is the ratio of CPU (processing) time to the length of the input speech file. A speech-to-text engine with lower RTF is more computationally efficient. We omit this metric for cloud-based engines.

Model Size

The aggregate size of models (acoustic and language), in MB. We omit this metric for cloud-based engines.

Engines

Usage

This benchmark has been developed and tested on Ubuntu 20.04.

Install FFmpeg
Download datasets.
Install the requirements:

pip3 install -r requirements.txt

Amazon Transcribe Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${AWS_PROFILE} with the name of AWS profile you wish to use.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AMAZON_TRANSCRIBE \
--aws-profile ${AWS_PROFILE}

Azure Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, ${AZURE_SPEECH_KEY} and ${AZURE_SPEECH_LOCATION} information from your Azure account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-speech-key ${AZURE_SPEECH_KEY}
--azure-speech-location ${AZURE_SPEECH_LOCATION}

Google Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${GOOGLE_APPLICATION_CREDENTIALS} with credentials download from Google Cloud Platform.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--google-application-credentials ${GOOGLE_APPLICATION_CREDENTIALS}

IBM Watson Speech-to-Text Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${WATSON_SPEECH_TO_TEXT_API_KEY}/${${WATSON_SPEECH_TO_TEXT_URL}} with credentials from your IBM account.

python3 benchmark.py \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--engine IBM_WATSON_SPEECH_TO_TEXT \
--watson-speech-to-text-api-key ${WATSON_SPEECH_TO_TEXT_API_KEY}
--watson-speech-to-text-url ${WATSON_SPEECH_TO_TEXT_URL}

Mozilla DeepSpeech Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, ${DEEP_SPEECH_MODEL} with path to DeepSpeech model file (.pbmm), and ${DEEP_SPEECH_SCORER} with path to DeepSpeech scorer file (.scorer).

python3 benchmark.py \
--engine MOZILLA_DEEP_SPEECH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--deepspeech-pbmm ${DEEP_SPEECH_MODEL} \
--deepspeech-scorer ${DEEP_SPEECH_SCORER}

Picovoice Cheetah Instructions

Replace ${DATASET} with one of the supported datasets, ${DATASET_FOLDER} with path to dataset, and ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--engine PICOVOICE_CHEETAH \
--dataset ${DATASET} \
--dataset-folder ${DATASET_FOLDER} \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

Picovoice Leopard Instructions