vantuyenatoma / demo_vietasr

Vietnamese Speech Recognition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VietASR (NVIDIA NeMo ToolKit)

⚡ Some experiment with NeMo

Result

  • Model: QuartzNet is a smaller version of Jaser model
  • I list the word error rate (WER) with and without LM of major ASR tasks.
Task CER (%) WER (%) +LM WER (%)
VIVOS (TEST) 6.80 18.02 15.72
VLSP2018 6.87 16.26 N/A
VLSP2020 T1 14.73 30.96 N/A
VLSP2020 T2 41.67 69.15 N/A

Model was trained with ~500 hours Vietnamese speech dataset, was collected from youtube, radio, call center(8k), text to speech data and some public dataset (vlsp, vivos, fpt). It is very small model (13M parameters) make it inference so fast ⚡

Installation

  • ctcdecoder, kemlm for LM Decode
    pip install ds-ctcdecoder
  • and some python libraries: torch, numpy, librosa, flask, flask_socketio, requests,...

Run Demo

TODO

  • Conformer Model
  • Transformer LM instead of kenlm
  • Data augumentation: speed, noise, pitch shift, time shift,...
  • FastAPI

About

Vietnamese Speech Recognition


Languages

Language:C++ 57.1%Language:Python 28.0%Language:Makefile 4.4%Language:Cuda 3.4%Language:Shell 2.7%Language:Jupyter Notebook 1.8%Language:Cython 1.4%Language:TeX 0.7%Language:CMake 0.2%Language:HTML 0.1%Language:M4 0.1%Language:C 0.1%Language:SWIG 0.0%Language:CSS 0.0%Language:Dockerfile 0.0%