Speech to Intent Benchmark

This is a framework to benchmark the accuracy of Picovoice's speech-to-intent engine (a.k.a rhino) against other natural language understanding engines. For more information regarding the engine refer to its repository directly here. This repository contains all data and code to reproduce the results. In this benchmark we evaluate the accuracy of engine for the context of voice enabled coffee maker. You can listen to one of the sample commands here. In order to simulate the real-life situations we have tested in two noisy conditions (1) Cafe and (2) Kitchen. You can listen to samples of noisy data here and here.

Additionally we compare the accuracy of rhino with Google Dialogflow, Amazon Lex, IBM Watson and Microsoft LUIS.

Data

The speech commands are crowd sourced from more than 50 unique speakers. Each speaker contributed about 10 different commands. Collectively there are 619 commands used in this benchmark. Noise is downloaded from Freesound.

Usage

Clone the directory and its submodules via

git clone --recurse-submodules https://github.com/Picovoice/speech-to-intent-benchmark.git

The repository grabs the latest version of rhino as a Git submodule under rhino. All data needed for this benchmark including speech, noise, and labels are provided under data. Additionally the Dialogflow agents used in this benchmark are exported here. The Amazon Lex bots used in this benchmark are exported here. The benchmark code is located under benchmark.

The first step is to mix the clean speech data under clean with noise. There are two types of noise used for this benchmark (1) cafe and (2) kitchen. In order to create noisy test data enter the following from the root of the repository in shell

python benchmark/mixer.py ${NOISE}

${NOISE} can be either kitchen or cafe.

Create accuracy results for running the noisy spoken commands through a NLU engine by running the following

python benchmark/benchmark.py --engine_type ${AN_ENGINE_TYPE} --noise ${NOISE}

The valid options for the engine_type parameter are: AMAZON_LEX, GOOGLE_DIALOGFLOW, IBM_WATSON, MICROSOFT_LUIS and PICOVOICE_RHINO.

In order to run the noisy spoken commands through Dialogflow API, include your Google Cloud Platform credential path and Google Cloud Platform project ID like the following

python benchmark/benchmark.py --engine_type GOOGLE_DIALOGFLOW --gcp_credential_path ${GOOGLE_CLOUD_PLATFORM_CREDENTIAL_PATH} --gcp_project_id ${GOOGLE_CLOUD_PLATFORM_PROJECT_ID} --noise ${NOISE}

To run the noisy spoken commands through IBM Watson API, include your IBM Cloud credential path and your Natural Language Understanding model ID.
If you already have a custom Speech to Text language model, include its ID using the argument --ibm_custom_id. Otherwise, a new custom language model will be created for you.

python benchmark/benchmark.py --engine_type IBM_WATSON --ibm_credential_path ${IBM_CREDENTIAL_PATH} --ibm_model_id ${IBM_MODEL_ID} --noise ${NOISE}

Before running noisy spoken commands through Microsoft LUIS, add your LUIS credentials and Speech credentials into /data/luis/credentials.env.

Results

Below is the result of benchmark. Command Acceptance Probability (Accuracy) is defined as the probability of the engine to correctly understand the speech command.
The Amazon Lex bot, Google Dialogflow agent, IBM Watson model and LUIS app used to produce these results were all trained on all 432 sample utterances.

tekkkon / speech-to-intent-benchmark

Speech to Intent Benchmark

Data

Usage

Results

About

Languages