charlesliucn / LanMIT

πŸ“– LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Low-resourced Language Modeling based on Kaldi

This repository provides Kaldi users with a few useful scripts for language modeling, especially for low-resourced conditions. The scripts are mainly based on babel/s5d in egs directory.

Most of the scripts are in babel/s5d and wsj/s5/steps.

Currently, the scripts are not so well organized. A document of detailed usage of these scripts will be added later.

image


Main Contributions

  • Data Augmentation
    • Text Preprocessing for Lexicon Generation
    • Vocabulary Expansion Based on Word Frequency
    • Data Selection Based on Multiple Criteria
  • N-Gram Language Models based on SRILM
    • Linear Interpolation for N-Gram models
    • N-Gram Language Model for Rescoring
  • LSTM Language Model Based on Tensorflow
    • Word Vectors Pre-training for RNN/LSTM Language Model Training
    • LSTM Language Model for Rescoring

Relevant Toolkits

  • XenC: an open-source tool for data selection in Natural Language Processing.
  • GloVe: Global Vectors for Word Representation.
  • SRILM: an Extensible Language Modeling Toolkit.

Contact

Any questions please send e-mails to charlesliutop@gmail.com.


More info about Kaldi Speech Recognition Toolkit, please see Kaldi's official github repository.

About

πŸ“– LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.

License:Other


Languages

Language:C++ 57.1%Language:Shell 20.6%Language:Python 11.2%Language:Perl 5.0%Language:C 2.1%Language:TeX 2.1%Language:Cuda 0.9%Language:HTML 0.6%Language:Makefile 0.3%Language:MATLAB 0.0%Language:Dockerfile 0.0%Language:sed 0.0%