This repository uses wav2vec2 model from hugging face transformers to create an ASR system which takes input speech signal as input and outputs transcriptions asynchronously.
I have also written a post explaining wave2vec2 in some detail with some further learning directions.
- Download and Install python
- Create a virtual environment using
python -m venv env_name
- enable created environment
env_path\Scripts\activate
- Install PyTorch
pip install torch==1.8.0+cu102 torchaudio===0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
- Install required dependencies
pip install -r requirements.txt
- Download and install miniconda
- Create a new virutal environment using
conda create --name env_name python==3.8
- enable create environment
conda activate env_name
- Install PyTorch
conda install pytorch torchaudio cudatoolkit=11.1 -c pytorch
- Install required dependencies
pip install -r requirements.txt
- run
python asr_inference_offline.py
with parameters:--model
or-m
: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)--pipeline
or-t
: path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)--output
or-out
: path to output file to save transcriptions. (not required)--device
or-d
: device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)--lm
orl
: path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)--beam_width
or-bw
: beam width to use for beam search decoder during inferencing (Defaults to 1). Ifbeam_width <= 1
then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
- example
python asr_inference_offline.py --recording data/samples/rec.wav -out output/transcription.txt
python asr_inference_offline.py --recording data/samples/rec.wav --device cuda
- run
python asr_inference_recording.py
with parameters:--recording
or-rec
: path to audio recording--model
or-m
: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)--pipeline
or-t
: path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)--blocksize
or-bs
: size of each audio block to be passed to model (Defaults to 16000)--overlap
or-ov
: overlapping between each loaded block (Defaults to 0)--output
or-out
: path to output file to save transcriptions. (not required)--device
or-d
: device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)--lm
orl
: path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)--beam_width
or-bw
: beam width to use for beam search decoder during inferencing (Defaults to 1). Ifbeam_width <= 1
then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
- example
python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -out output/transcription.txt
python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -ov 1600 -out output/transcription.txt
python asr_inference_recording.py --recording data/samples/rec.wav -bs 16000 -ov 1600 -out output/transcription.txt --device gpu
- run
python asr_inference_live.py
with parameters:--model
or-m
: path to saved wavenetctc local model if not passed it will be downloaded (Defaults to None)--pipeline
or-t
: path to saved wav2vec local pipeline path if not passed then it will be downloaded (Defaults to None)--blocksize
or-bs
: size of each audio block to be passed to model (Defaults to 16000)--output
or-out
: path to output file to save transcriptions. (not required)--device
or-d
: device to use for inferencing (choices=["cpu", "cuda"] and Defaults to cpu)--lm
orl
: path to folder in which trained language model is saved with unigram and bigram files. This language model will be used by beam search algorithm to weight scores of beams (Defaults to None)--beam_width
or-bw
: beam width to use for beam search decoder during inferencing (Defaults to 1). Ifbeam_width <= 1
then max decoding will be used to decode ctc inputs, else beam search decoding will be used.
- example
python asr_inference_live.py -bs 16000 -out output/transcription.txt
python asr_inference_live.py
python asr_inference_live.py --device cuda
- run
python asr_inference_live.py
with parameters:--corpus
or-c
: path to corpus text file.--save
or-s
: folder path to save model files.
All notebooks resides in notebook folder these are handy when using google colab or similar platforms. All these notebooks are tested in google colab.
wav2vec2_asr_pretrained_inference
: Basic inference notebookwav2vec2_experiment_language_model
: kenlm language model with beam searchwav2vec2large_experiment_language_model
: kenlm language model with beam search for larger modelwav2vec2_finetuning_version_1
: finetuning notebook without augmentationwav2vec2_finetuning_version_2_with_data_augmentations
: finetuning notebook with augmentationTraining_Simple_Lanugage_Model
: training language model notebook version with wikipedia data
For 4min 10sec recorder audio total time taken
- GPU (Nvidia GeForce 940MX) : 18.29sec
- CPU : 116.85sec
- Environment Setup ✔
- Inferencing with CPU ✔
- Inferencing with GPU ✔
- Asyncio Compatible ✔
- Training and Finetuning Notebooks ✔
- Training and Finetuning Scripts
- Converting model to TensorFlow with ONNX for inference using TensorFlow
- native windows 10 ✔
- windows-10 wsl2 cpu ✔
- windows-10 wsl2 gpu ✔
- Linux ✔