vietnamese-asr vietnamese-speech-recognition vietnamese-speech-to-text ctc-decode quartznet vlsp vlsp2020 ljspeech-format speech-recognition speech-to-text streamlit-app demo-asr w2v speech-to-text-app

Vietnamese-Speech-Recognition

Introduction

In this repo, I focused on building end-to-end speech recognition pipeline using Quartznet, wav2vec2.0 and CTC decoder supported by beam search algorithm as well as language model.

Setup

Datasets

Here I used 100h speech public dataset of Vinbigdata , which is a small clean set of VLSP2020 ASR competition. Some infomation of this dataset can be found at data/Data_Workspace.ipynb. The data format I would use to train and evaluate is just like LJSpeech, so I create data/custom.py to customize the given dataset.

mkdir data/LJSpeech-1.1 
python data/custom.py # create data format for training quartnet & w2v2.0

And below is the folder that I used, note that metadata.csv has 2 columns, file name and transcript:

├───data
│   ├───LJSpeech-1.1
│   │   └───wavs
│   │   └───metadata.csv
│   └───vlsp2020_train_set_02
├───datasets
├───demo
├───models
│   └───quartznet
│       └───base
├───tools
└───utils

Environment

You can create your environment and install the requirements file and note that torch should be installed based on your CUDA version. With conda:

cd Vietnamese-Speech-Recognition
conda create -n asr
conda activate asr
conda install --file requirements.txt

Also, you need to install ctcdecode:

git clone --recursive https://github.com/parlance/ctcdecode.git
cd ctcdecode && pip install . && cd ..

Tools

Training & Evaluation

For training the quartznet model, you can run:

python3 tools/train.py --config configs/config.yaml

And evaludate quartnet:

python3 tools/evaluate.py --config configs/config.yaml

Or you wanna finetune wav2vec2.0 model from Vietnamese pretrained w2v2.0:

python3 tools/fintune_w2v.py

Demo

This time, I provide small code with streamlit for asr demo, you can run:

streamlit run demo/app.py

Results

I used wandb&tensorboard for logging results and antifacts during training, here are some visualizations after several epochs:

Quartznet	W2v 2.0

References

Mainly based on this implementation
The paper
Vietnamese ASR - VietAI
Lightning-Flash repo
Tokenizer used from youtokentome
Language model KenLM

About

This repo aims to build a web app that supports speech recognition system :smiley: It's simple to use and understand :smile:

vietnamese-asr vietnamese-speech-recognition vietnamese-speech-to-text ctc-decode quartznet vlsp vlsp2020 ljspeech-format speech-recognition speech-to-text streamlit-app demo-asr w2v speech-to-text-app

MIT License

Languages

Language:Jupyter Notebook 87.2%Language:Python 12.6%Language:Dockerfile 0.1%Language:Shell 0.0%