This repo will contain a list of useful resources for Mongolian NLP. Feel free to contribute.
DATASET
LJSpeech like male voice TTS dataset created from the Mongolian Bible- used in tugstugi/pytorch-dc-tts
- use dl_and_preprop_dataset.py to download the audio files
DATASET
LJSpeech like Kalmyk (West Mongolian) female voice TTS dataset created from the Kalmyk Bible (2 hours)DATASET
Eduge news classification dataset provided by Bolorsoft LLC- used to train the Eduge.mn production news classifier
- 75K news with 9 categories:
урлаг соёл
,эдийн засаг
,эрүүл мэнд
,хууль
,улс төр
,спорт
,технологи
,боловсрол
andбайгал орчин
DATASET
11-11.mn government agency complaint dataset- 80K with 5 categories:
санал хүсэлт
,гомдол
,шүүмжлэл
,талархал
andөргөдөл
- 80K with 5 categories:
DATASET
online news corpus- 700 million words
DATASET
Digital Archive of Mongolian Newspapers 1990-1995 of the British Library- Common Crawl Mongolian dataset
- opendata.burtgel.gov.mn
DATASET
220K Mongolian personal namesDATASET
90K Mongolian clan/family namesDATASET
192K Mongolian company names
DATASET
Mongolian provinces (aimags and sums) namesDATASET
195 country (with capital cities) names in MongolianDATASET
250 Mongolian most frequent words from Mongolian news, books and Wikipedia articles. (total 670M words / 2M unique words).- These words could be used also as the stop words.
DATASET
500 Mongolian abbreviationsDATASET
Mongolian NER dataset created from Mongolian politics and sport newsDATASET
Mongolian POS dataset of the Mongolian National University- 100k words
- used POS tagsets
DATASET
Traditional Mongolian synthetic OCR dataset created from Mongolian song lyrics and dictionary- 80K images
- without any data augmentation, for augmenting data use external libraries like albumentations.
DATASET
Handwritten Mongolian Cyrillic Characters Database of the Mongolian University of Science and Technology- 28x28 gray scale, 350k images
- dataset description
DATASET
Mongolian Wordnet of the Mongolian National University- 26875 words, 2979 glosses, 23665 synsets, 213 examples
PYTORCH
tugstugi/pytorch-dc-ttsDEMO
Colab online demoDATASET
LJSpeech like male voice dataset created from the Mongolian Bible
TF
tugstugi/Tacotron-2 fork of Rayhane-mamah/Tacotron-2 adapted for the Mongolian Bible datasetDEMO
Colab online demoDEMO
speaker adaptation Colab online demo for the former Mongolian president Elbegdorj. The Tacotron model trained with the 5 hours Mongolian Bible dataset was fine tuned with a 10 minutes dataset created from a Elbegdorj's speech.
PYTORCH
Chimege TTS demo- 1x female
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO
HMM TTS online demo of the Mongolian National University- 1x male and 2x female voices
DEMO
Yet another HMM? TTS online demo from “Мон Спийч Ай Ти” ХХК- demo server is currently down
- 1x male and 1x female
- female voice samples
SAMPLES
Tacotron2 TTS demo samples of Ikon.MN- 1x female (35h)
- NVIDIA/tacotron2 + NVIDIA/waveglow
DEMO
HMM based TTS online demo of the Inner Mongolian university- 1x female
PRODUCT
NVDA/HTS screen reader developed by Innovation Development Center for the blind- 1x female (Mongolian National University voice)
PYTORCH/DEMO
Kalmyk TTS demo Kalmyk is a Mongolic language spoken in Russia- dataset created from the Kalmyk Bible (2 hours)
- NVIDIA/tacotron2 + NVIDIA/waveglow
MODEL
5-gram binary LM generated by KenLM on a 670M word dirty corpus.- it can be used either with mozilla/DeepSpeech:
./generate_trie alphabet.txt mn_5gram.binary trie
- or in tugstugi/mongolian-speech-recognition
- it can be used either with mozilla/DeepSpeech:
TF
/PYTORCH
tugstugi/mongolian-bert pretrained Mongolian BERT models- trained by tugstugi, enod and sharavsambuu
- nabar sponsored 5x TPUs.
PYTORCH
bayartsogt-ya/albert-mongolian pretrained Mongolian ALBERTPYTORCH
robertritz/NLP ULMFiT experiments
PYTORCH
tugstugi/mongolian-speech-recognitionDEMO
Chimege Speech Recognition- a proprietary dataset is used
PRODUCT
Chinese and traditional Mongolian voice input from aicloud.comDEMO
Speech recognition of the Inner Mongolian university- seems to be non functional
PRODUCT
Huawei cloud ASR supports minority languages such as Mongolian, Tibetan, and Uyghur.PRODUCT
Google Cloud Speech-to-text- 20% WER on a 3000 audio privata test dataset
PYTORCH
Wav2Vec2 XLSR finetuned on Mongolian Common VoiceDEMO
Colab online demo- 50% WER
PYTORCH
Wav2Vec2 XLSR finetuned on the Kalmyk Bible dataset.DEMO
Colab online demo
DEMO
Cyrillic to Mongolian script converter demo of the Inner Mongolian universityDEMO
Mongolian script OCR demo of the Inner Mongolian universityPYTORCH
tugstugi/bichig2cyrillic Mongolian script to (and back) cyrillic converterPYTORCH
tugstugi/image2bichig Traditional Mongolian OCR using CRNN
TF2
sharavsambuu/mongolian-text-classificationSKLEARN
/DEMO
simple SVM Colab notebook classifying the Eduge dataset with around 91% accuracy.- SentencePiece model from tugstugi/mongolian-bert is used as the text tokenizer.
DATASET
Mongolian NER dataset created from Mongolian politics and sport news- for more info see datasets
PYTORCH
enod/mongolian-bert-ner BERT based Mongolian NER- uses tugstugi/mongolian-bert Mongolian pre-trained BERT models
DEMO
NER demo of the Mongolian National University
PYTORCH
tugstugi/forced_aligner Mongolian forced alignment tool using Rayhane-mamah/Tacotron-2 and readbeyond/aeneasDEMO
Colab online demo
TF2
cyrillic transliteration Colab notebook sharavsambuu/cyrillic-mongolian-transliterationDATASET
1M back-translated MN->EN sentence dataset download linkDICTIONARY
Mongolian digitalized dictionaries from Center for Northeast Asian of the Tohoku University in Japan- for usage see Digitizing the Mongolian Language: An Introduction to the Polyglot “Online Dictionaries and Full-text Search of Mongolian Languages and Written Manchu”
- it includes also IPA pronuncations for Mongolian words