charlesliucn/LanMIT

low-resource-languages language-modeling speech-recognition speech-to-text keyword-spotting kaldi-asr

Low-resourced Language Modeling based on Kaldi

This repository provides Kaldi users with a few useful scripts for language modeling, especially for low-resourced conditions. The scripts are mainly based on babel/s5d in egs directory.

Most of the scripts are in babel/s5d and wsj/s5/steps.

Currently, the scripts are not so well organized. A document of detailed usage of these scripts will be added later.

Main Contributions

Data Augmentation
- Text Preprocessing for Lexicon Generation
- Vocabulary Expansion Based on Word Frequency
- Data Selection Based on Multiple Criteria
N-Gram Language Models based on SRILM
- Linear Interpolation for N-Gram models
- N-Gram Language Model for Rescoring
LSTM Language Model Based on Tensorflow
- Word Vectors Pre-training for RNN/LSTM Language Model Training
- LSTM Language Model for Rescoring

Relevant Toolkits

XenC: an open-source tool for data selection in Natural Language Processing.
GloVe: Global Vectors for Word Representation.
SRILM: an Extensible Language Modeling Toolkit.

Contact

Any questions please send e-mails to charlesliutop@gmail.com.

More info about Kaldi Speech Recognition Toolkit, please see Kaldi's official github repository.

About

📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.

low-resource-languages language-modeling speech-recognition speech-to-text keyword-spotting kaldi-asr

Other

Languages

Language:C++ 57.1%Language:Shell 20.6%Language:Python 11.2%Language:Perl 5.0%Language:C 2.1%Language:TeX 2.1%Language:Cuda 0.9%Language:HTML 0.6%Language:Makefile 0.3%Language:MATLAB 0.0%Language:Dockerfile 0.0%Language:sed 0.0%