sks901 / nlp-for-hindi

State of the Art Tokenizer, Language model and Classifier for hindi language (spoken in Indian sub-continent)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nlp-for-hindi

State of the Art Tokenizer, Language model and Classifier for Hindi language (spoken in Indian sub-continent)

Dataset

Download Wikipedia Articles Dataset (55,000 articles) which I scraped, cleaned and trained model on from here.

Note: There are more than 1.5 lakh Hindi wikipedia articles, whose urls you can find in the pickled object in the above folder. I chose to work with 55,000 articles only because of computational constraints.

Get Hindi Movie Reviews Dataset which I scraped, cleaned and trained classification model from the repository path datasets-preparation/hindi-movie-review-dataset

Thanks to nirantk for BBC Hindi News dataset

Results

Language Model
  • Perplexity of Language Model: ~36 (on 20% validation set)
Movie Review Classification Model
  • Accuracy of Movie Review classification model: ~53

  • Kappa Score of Movie Review classification model: ~30

Note: The movie review classification data set has 3 classes [Positive, Neutral, Negative], and not 2. I settled with accuracy of 53% (which is better than just random for 3 classes) because the data set had only:

  • 335 Positive Examples
  • 270 Neutral Examples
  • 293 Negative Examples

which I think are too less to give higher accuracy.

BBC News Classification Model
  • Accuracy of BBC News classification model: ~79

  • Kappa Score of BBC News classification model: ~72

Note: nirantk has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. I have achieved better perplexity i.e ~35, but these scores aren't directly comparable because he used hindi wikipedia dumps for training whereas I scraped 55,000 articles and cleaned them through scripts in datasets-preparation. Though, one big reason I feel the results I have achieved should be better because I'm using sentencepiece for unsupervized tokenization whereas nirantk was using spacy.

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download Movie Review classifier from here

Download BBC News classifier from here

Tokenizer

Unsupervised training using Google's sentencepiece

Download the trained model and vocabulary from here

About

State of the Art Tokenizer, Language model and Classifier for hindi language (spoken in Indian sub-continent)

License:MIT License


Languages

Language:Jupyter Notebook 100.0%