Wake word detection modeling for Firefox Voice, supporting open datasets like Google Speech Commands and Mozilla Common Voice.
Citation:
@article{tang2020howl,
title={Howl: A Deployed, Open-Source Wake Word Detection System},
author={Raphael Tang and Jaejun Lee and Afsaneh Razi and Julia Cambre and Ian Bicking and Jofish Kaye and Jimmy Lin},
journal={arXiv:2008.09606},
year={2020}
}
A proper Pip package is coming soon.
-
git clone https://github.com/castorini/howl && cd howl
-
Install PyTorch by following your platform-specific instructions.
-
Install PyAudio and its dependencies through your distribution's package system.
-
pip install -r requirements.txt
(some apt packages might need to be installed)
In the example that follows, we describe how to train a custom detector for the word, "fire."
- Download a supported data source. We recommend Common Voice for its breadth and free license.
- To provide alignment for the data, install Montreal Forced Aligner (MFA) and download an English pronunciation dictionary.
- Create a positive dataset containing the keyword:
VOCAB='["fire"]' INFERENCE_SEQUENCE=[0] DATASET_PATH=data/fire-positive python -m howl.run.create_raw_dataset --negative-pct 0 -i ~/path/to/common-voice --positive-pct 100
- Create a negative dataset without the keyword:
VOCAB='["fire"]' INFERENCE_SEQUENCE=[0] DATASET_PATH=data/fire-negative python -m howl.run.create_raw_dataset --negative-pct 5 -i ~/path/to/common-voice --positive-pct 0
- Generate some mock alignment for the negative set, where we don't care about alignment:
DATASET_PATH=data/fire-negative python -m howl.run.attach_alignment --align-type stub
- Use MFA to generate alignment for the positive set:
mfa_align data/fire-positive/audio eng.dict pretrained_models/english.zip output-folder
- Attach the MFA alignment to the positive dataset:
DATASET_PATH=data/fire-positive python -m howl.run.attach_alignment --align-type mfa -i output-folder
- Source the relevant environment variables for training the
res8
model:source envs/res8.env
. - Train the model:
python -m howl.run.train -i data/fire-negative data/fire-positive --model res8 --workspace workspaces/fire-res8
. - For the CLI demo, run
python -m howl.run.demo --model res8 --workspace workspaces/fire-res8
.
First, follow the installation instructions in the quickstart guide.
- Download the Google Speech Commands dataset and extract it.
- Source the appropriate environment variables:
source envs/res8.env
- Set the dataset path to the root folder of the Speech Commands dataset:
export DATASET_PATH=/path/to/dataset
- Train the
res8
model:NUM_EPOCHS=20 MAX_WINDOW_SIZE_SECONDS=1 VOCAB='["yes","no","up","down","left","right","on","off","stop","go"]' BATCH_SIZE=64 LR_DECAY=0.8 LEARNING_RATE=0.01 python -m howl.run.pretrain_gsc --model res8
- Download the Hey Firefox corpus, licensed under CC0, and extract it.
- Download our noise dataset, built from Microsoft SNSD and MUSAN, and extract it.
- Source the appropriate environment variables:
source envs/res8.env
- Set the noise dataset path to the root folder:
export NOISE_DATASET_PATH=/path/to/snsd
- Set the firefox dataset path to the root folder:
export DATASET_PATH=/path/to/hey_firefox
- Train the model:
LR_DECAY=0.98 VOCAB='[" hey","fire","fox"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1,2] MAX_WINDOW_SIZE_SECONDS=0.5 python -m howl.run.train --model res8 --workspace workspaces/hey-ff-res8
- Download hey snips dataset
- Process the dataset to a format howl can load
VOCAB='["hey","snips"]' INFERENCE_SEQUENCE=[0,1] DATASET_PATH=data/hey-snips python -m howl.run.create_raw_dataset --dataset-type 'hey-snips' -i ~/path/to/hey_snips_dataset
- Generate some mock alignment for the dataset, where we don't care about alignment:
DATASET_PATH=data/hey-snips python -m howl.run.attach_alignment --align-type stub
- Use MFA to generate alignment for the dataset set:
mfa_align data/hey-snips/audio eng.dict pretrained_models/english.zip output-folder
- Attach the MFA alignment to the dataset:
DATASET_PATH=data/hey-snips python -m howl.run.attach_alignment --align-type mfa -i output-folder
- Source the appropriate environment variables:
source envs/res8.env
- Set the noise dataset path to the root folder:
export NOISE_DATASET_PATH=/path/to/snsd
- Set the noise dataset path to the root folder:
export DATASET_PATH=/path/to/hey-snips
- Train the model:
LR_DECAY=0.98 VOCAB='[" hey","snips"]' USE_NOISE_DATASET=True BATCH_SIZE=16 INFERENCE_THRESHOLD=0 NUM_EPOCHS=300 NUM_MELS=40 INFERENCE_SEQUENCE=[0,1] MAX_WINDOW_SIZE_SECONDS=0.5 python -m howl.run.train --model res8 --workspace workspaces/hey-snips-res8