peihuaining/mongolian-nlp

This repo will contain a list of useful resources for Mongolian NLP. Feel free to contribute.

Datasets

DATASET LJSpeech like male voice TTS dataset created from the Mongolian Bible
- used in tugstugi/pytorch-dc-tts
- use dl_and_preprop_dataset.py to download the audio files
DATASET LJSpeech like Kalmyk (West Mongolian) female voice TTS dataset created from the Kalmyk Bible (2 hours)
DATASET Eduge news classification dataset provided by Bolorsoft LLC
- used to train the Eduge.mn production news classifier
- 75K news with 9 categories: урлаг соёл, эдийн засаг, эрүүл мэнд, хууль, улс төр, спорт, технологи, боловсрол and байгал орчин
DATASET 11-11.mn government agency complaint dataset
- 80K with 5 categories: санал хүсэлт, гомдол, шүүмжлэл, талархал and өргөдөл
DATASET online news corpus
- 700 million words
DATASET Digital Archive of Mongolian Newspapers 1990-1995 of the British Library
Common Crawl Mongolian dataset
opendata.burtgel.gov.mn
- DATASET 220K Mongolian personal names
- DATASET 90K Mongolian clan/family names
- DATASET 192K Mongolian company names
DATASET Mongolian provinces (aimags and sums) names
DATASET 195 country (with capital cities) names in Mongolian
DATASET 250 Mongolian most frequent words from Mongolian news, books and Wikipedia articles. (total 670M words / 2M unique words).
- These words could be used also as the stop words.
DATASET 500 Mongolian abbreviations
DATASET Mongolian NER dataset created from Mongolian politics and sport news
- 10K sentences annotated by tugstugi and enod using doccano
- 4 categories LOCATION (6453/1753), PERSON (2839/1698), ORGANIZATION (4453/1970) and MISC (3716/2617)
DATASET Mongolian POS dataset of the Mongolian National University
- 100k words
- used POS tagsets
DATASET Traditional Mongolian synthetic OCR dataset created from Mongolian song lyrics and dictionary
- 80K images
- without any data augmentation, for augmenting data use external libraries like albumentations.
DATASET Handwritten Mongolian Cyrillic Characters Database of the Mongolian University of Science and Technology
- 28x28 gray scale, 350k images
- dataset description
DATASET Mongolian Wordnet of the Mongolian National University
- 26875 words, 2979 glosses, 23665 synsets, 213 examples

Mongolian Text-to-Speech

PYTORCH tugstugi/pytorch-dc-tts
- DEMO Colab online demo
- DATASET LJSpeech like male voice dataset created from the Mongolian Bible
TF tugstugi/Tacotron-2 fork of Rayhane-mamah/Tacotron-2 adapted for the Mongolian Bible dataset
- DEMO Colab online demo
- DEMO speaker adaptation Colab online demo for the former Mongolian president Elbegdorj. The Tacotron model trained with the 5 hours Mongolian Bible dataset was fine tuned with a 10 minutes dataset created from a Elbegdorj's speech.
PYTORCH Chimege TTS demo
- 1x female
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO HMM TTS online demo of the Mongolian National University
- 1x male and 2x female voices
DEMO ~~Yet another HMM? TTS online demo from “Мон Спийч Ай Ти” ХХК~~
- demo server is currently down
- 1x male and 1x female
- female voice samples
SAMPLES Tacotron2 TTS demo samples of Ikon.MN
- 1x female (35h)
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO HMM based TTS online demo of the Inner Mongolian university
- 1x female
PRODUCT NVDA/HTS screen reader developed by Innovation Development Center for the blind
- 1x female (Mongolian National University voice)
PYTORCH/DEMO Kalmyk TTS demo Kalmyk is a Mongolic language spoken in Russia
- dataset created from the Kalmyk Bible (2 hours)
- NVIDIA/tacotron2 + NVIDIA/waveglow

Mongolian Language Model

MODEL 5-gram binary LM generated by KenLM on a 670M word dirty corpus.
- it can be used either with mozilla/DeepSpeech: ./generate_trie alphabet.txt mn_5gram.binary trie
- or in tugstugi/mongolian-speech-recognition
TF / PYTORCH tugstugi/mongolian-bert pretrained Mongolian BERT models
- trained by tugstugi, enod and sharavsambuu
- nabar sponsored 5x TPUs.
PYTORCH bayartsogt-ya/albert-mongolian pretrained Mongolian ALBERT
PYTORCH robertritz/NLP ULMFiT experiments

Mongolian Speech Recognition

PYTORCH tugstugi/mongolian-speech-recognition
- DEMO Chimege Speech Recognition
- a proprietary dataset is used
PRODUCT Chinese and traditional Mongolian voice input from aicloud.com
- direct link to the APK file
- seems to be working only for simple cases (or it works only for Southern Mongolian dialects...)
- same system but for windows (according to someone, you have to register with a Chinese identity card to use it)
DEMO ~~Speech recognition of the Inner Mongolian university~~
- seems to be non functional
PRODUCT Huawei cloud ASR supports minority languages such as Mongolian, Tibetan, and Uyghur.
PRODUCT Google Cloud Speech-to-text
- 20% WER on a 3000 audio privata test dataset
PYTORCH Wav2Vec2 XLSR finetuned on Mongolian Common Voice
- DEMO Colab online demo
- 50% WER
PYTORCH Wav2Vec2 XLSR finetuned on the Kalmyk Bible dataset.
- DEMO Colab online demo

Mongolian Script

DEMO Cyrillic to Mongolian script converter demo of the Inner Mongolian university
DEMO Mongolian script OCR demo of the Inner Mongolian university
PYTORCH tugstugi/bichig2cyrillic Mongolian script to (and back) cyrillic converter
- DEMO Cyrillic to Mongolian Colab online demo
PYTORCH tugstugi/image2bichig Traditional Mongolian OCR using CRNN
- DEMO OCR Colab online demo
- DATASET Traditional Mongolian synthetic OCR dataset

Mongolian Text Classification

TF2 sharavsambuu/mongolian-text-classification
SKLEARN / DEMO simple SVM Colab notebook classifying the Eduge dataset with around 91% accuracy.
- SentencePiece model from tugstugi/mongolian-bert is used as the text tokenizer.

Mongolian Named Entity Recognition

DATASET Mongolian NER dataset created from Mongolian politics and sport news
- for more info see datasets
PYTORCH enod/mongolian-bert-ner BERT based Mongolian NER
- uses tugstugi/mongolian-bert Mongolian pre-trained BERT models
DEMO NER demo of the Mongolian National University

Misc

PYTORCH tugstugi/forced_aligner Mongolian forced alignment tool using Rayhane-mamah/Tacotron-2 and readbeyond/aeneas
- DEMO Colab online demo
TF2 cyrillic transliteration Colab notebook sharavsambuu/cyrillic-mongolian-transliteration
DATASET 1M back-translated MN->EN sentence dataset download link
- sharavsambuu/english-mongolian-nmt-dataset-augmentation
DICTIONARY Mongolian digitalized dictionaries from Center for Northeast Asian of the Tohoku University in Japan
- for usage see Digitizing the Mongolian Language: An Introduction to the Polyglot “Online Dictionaries and Full-text Search of Mongolian Languages and Written Manchu”
- it includes also IPA pronuncations for Mongolian words

peihuaining / mongolian-nlp

Datasets

Mongolian Text-to-Speech

Mongolian Language Model

Mongolian Speech Recognition

Mongolian Script

Mongolian Text Classification

Mongolian Named Entity Recognition

Misc

About

Languages