Automatic Sound Recognition using Wav2Vec2

This repository uses wav2vec2 model from hugging face transformers to create an ASR system which takes input speech signal as input and outputs transcriptions asynchronously.

I have also written a post explaining wave2vec2 in some detail with some further learning directions.

Installation

Installing via pip

Download and Install python
Create a virtual environment using python -m venv env_name
enable created environment env_path\Scripts\activate
Install PyTorch pip install torch==1.8.0+cu102 torchaudio===0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
Install required dependencies pip install -r requirements.txt

Installing via conda

Download and install miniconda
Create a new virutal environment using conda create --name env_name python==3.8
enable create environment conda activate env_name
Install PyTorch conda install pytorch torchaudio cudatoolkit=11.1 -c pytorch
Install required dependencies pip install -r requirements.txt

Inferencing

transcribing an audio file

run python asr_inference_offline.py with parameters:
- --model or -m: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)
- --pipeline or -t : path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)
- --output or -out : path to output file to save transcriptions. (not required)
- --device or -d : device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)
- --lm or l : path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)
- --beam_width or -bw : beam width to use for beam search decoder during inferencing (Defaults to 1). If beam_width <= 1 then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
example
- python asr_inference_offline.py --recording data/samples/rec.wav -out output/transcription.txt
- python asr_inference_offline.py --recording data/samples/rec.wav --device cuda

transcribing a streaming audio

run python asr_inference_recording.py with parameters:
- --recording or -rec : path to audio recording
- --model or -m: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)
- --pipeline or -t : path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)
- --blocksize or -bs : size of each audio block to be passed to model (Defaults to 16000)
- --overlap or -ov : overlapping between each loaded block (Defaults to 0)
- --output or -out : path to output file to save transcriptions. (not required)
- --device or -d : device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)
- --lm or l : path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)
- --beam_width or -bw : beam width to use for beam search decoder during inferencing (Defaults to 1). If beam_width <= 1 then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
example
- python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -out output/transcription.txt
- python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -ov 1600 -out output/transcription.txt
- python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -ov 1600 -out output/transcription.txt --device gpu

live recording and transcribing

run python asr_inference_live.py with parameters:
- --model or -m: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)
- --pipeline or -t : path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)
- --blocksize or -bs : size of each audio block to be passed to model (Defaults to 16000)
- --output or -out : path to output file to save transcriptions. (not required)
- --device or -d : device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)
- --lm or l : path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)
- --beam_width or -bw : beam width to use for beam search decoder during inferencing (Defaults to 1). If beam_width <= 1 then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
example
- python asr_inference_live.py -bs 16000 -out output/transcription.txt
- python asr_inference_live.py
- python asr_inference_live.py --device cuda

Training Language Model

run python asr_inference_live.py with parameters:
- --corpus or -c : path to corpus text file.
- --save or -s : folder path to save model files.

Notebooks

All notebooks resides in notebook folder these are handy when using google colab or similar platforms. All these notebooks are tested in google colab.

wav2vec2_asr_pretrained_inference : Basic inference notebook
wav2vec2_experiment_language_model : kenlm language model with beam search
wav2vec2large_experiment_language_model : kenlm language model with beam search for larger model
wav2vec2_finetuning_version_1 : finetuning notebook without augmentation
wav2vec2_finetuning_version_2_with_data_augmentations : finetuning notebook with augmentation
Training_Simple_Lanugage_Model : training language model notebook version with wikipedia data

Comparisions

GPU inference vs CPU inference

For 4min 10sec recorder audio total time taken

GPU (Nvidia GeForce 940MX) : 18.29sec
CPU : 116.85sec

To do list

Environment Setup ✔
Inferencing with CPU ✔
Inferencing with GPU ✔
Asyncio Compatible ✔
Training and Finetuning Notebooks ✔
Training and Finetuning Scripts
Converting model to TensorFlow with ONNX for inference using TensorFlow

Tested Platforms

native windows 10 ✔
windows-10 wsl2 cpu ✔
windows-10 wsl2 gpu ✔
Linux ✔

tarun-bisht / wav2vec2-asr