NLP for Nepali
This repository contains State of the Art Tokenizer, Language model and Classifier for Nepali, which is official language of Nepal and one of the official status gained language of India.
Dataset
-
Download Nepali Wikipedia Articles Dataset (38,757 articles) which I scraped, cleaned and used to train the language model
-
Download Nepali News classification Dataset which I scraped and used to train the classifier
Results
Language Model
on 30% validation set
- Perplexity of language model: ~32
Classifier
- Accuracy of classification model: ~97%
- Kappa score of classification model: ~96
Pretrained Language Model
Download pretrained Language Model from here
Classifier
Download classifier from here
Tokenizer
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here