CommonVoice-TH Recipe

A commonvoice-th recipe for training ASR engine using Kaldi. The following recipe follows commonvoice recipe with slight modification

Installation

The author use docker to run the container. GPU is required to train tdnn_chain, else the script can train only up to tri3b.

Downloading Commonvoice Corpus

We will need a commonvoice corpus for training ASR Engine. We are using Commonvoice Corpus 7.0 in Thai language which can be download here. Once downloaded, unzip it as we will use it later to mount dataset to the docker container.

Downloading SRILM

Before building docker, SRILM file need to be downloaded. You can download it from here. Once the file is downloaded, remove version name (e.g. from srilm-1.7.3.tar.gz to srilm.tar.gz and place it inside docker directory. Your docker directory should contains 2 files: dockerfile, and srilm.tar.gz.

Building Docker for Training with Kaldi

Once you have prepared SRILM file, you are ready to build docker for training this recipe. This docker automatically install project's dependendies and stored it in an image. To build a docker image, run:

$ cd docker
$ docker build -t <docker-name> kaldi

Run docker and attach command line

Once the image had been built, all you have to do is interactively attach to its bash terminal via the following command:

$ docker run -it -v <path-to-repo>:/opt/kaldi/egs/commonvoice-th \
                 -v <path-to-repo>/labels:/mnt/labels \
                 -v <path-to-cv-corpus>:/mnt \
                 --gpus all --name <container-name> <built-docker-name> bash

Once you finish this step, you should be in a docker container's bash terminal now

Building Docker for inferencing via Vosk

We also provide an example of how to inference a trained kaldi model using Vosk. Berore we begin, let's build Vosk docker image:

$ cd docker
$ docker build -t <docker-name> vosk-inference
$ cd ..  # back to root directory

Preparing Directories for Vosk Inferencing

The first step is to download provided Vosk model format on this github's release. Unzip it to vosk-inference directory. Or you can just follow this code.

$ cd vosk-inference
$ wget https://github.com/vistec-AI/commonvoice-th/releases/download/vosk-v1/model.zip
$ unzip model.zip

Run docker and test inference script

To prevent dependencies problem, the Vosk inference python script must be run inside a docker image that we just built. First, let's initiate a docker

$ docker run -it -v <path-to-repo>:/workspace \
                 --name <container-name> \
                 -p 8000:8000 \
                 <build-docker-name> bash

Then, you will be attached to a linux terminal inside the container. To inference an audio file, run:

$ cd vosk-inference
$ python3.8 inference.py --wav-path <path-to-wav>  # test it with test.wav

Note that audio file must be 16k samping rate and mono channel!

Instaltiating Vosk Server to Processing audio files

We also provide a fastapi server that will allow user to transcribe their own audio file via RESTful API. To instantiate server, run this command inside a docker shell

$ cd vosk-inference
$ uvicorn server:app --host 0.0.0.0 --reload

Now, the server will instantiate at http://localhost:8000. To see if server is correctly instantiated, try to browse http://localhost:8000/healthcheck. If the webpage loaded then we are good to go!

API Endpoint

The endpoint will be in form-data format where each file is attached to a form field named audios. See python example

import requests

url = "localhost:8000/transcribe"

payload={}
files=[
    ('audios', (<file-name>, open(<file-path>, 'rb'), 'audio/wav')),
    ...
]
headers = {}

response = requests.request("POST", url, headers=headers, data=payload, files=files)

print(response.text)

Online Decoding with WebRTC Protocol

Read more at this repository. The provided repository contains an easy way to deploy Kaldi tdnn-chain model to webRTC server.

Usage

To run the training pipeline, go to recipe directory and run run.sh script

$ cd /opt/kaldi/egs/commonvoice-th/s5
$ ./run.sh --stage 0

Experiment Results

Here are some experiment results evaluated on dev set:

Model	dev		dev-unique
Model	WER	CER	WER	CER
mono	79.13%	57.31%	77.79%	48.97%
tri1	56.55%	37.88%	53.26%	27.99%
tri2b	50.64%	32.85%	47.38%	21.89%
tri3b	50.52%	32.70%	47.06%	21.67%
tri4b	46.81%	29.47%	43.18%	18.05%
tdnn-chain	29.15%	14.96%	30.84%	8.75%
tdnn-chain-online	29.02%	14.64%	30.41%	8.28%

Here is final test set result evaluated on tdnn-chain

Model	test		test-unique
Model	WER	CER	WER	CER
tdnn-chain-online	9.71%	3.12%	23.04%	7.57%
airesearch/wav2vec2-xlsr-53-th	-	-	13.63	2.81%
Google Web Speech API	-	-	13.71%	7.36%
Microsoft Bing Search API	-	-	12.58%	5.01%
Amazon Transcribe	-	-	21.86%	7.08%

Author

Chompakorn Chaksangchaichot

About

Kaldi recipe to train commonvoice corpus in Thai language

Languages

Language:Shell 80.4%Language:Python 19.6%