Praggie / hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Home Page:http://nirantk.com/hindi2vec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)

  • To the best of my knowledge on September 18, 2018

Update: nlp-for-hindi uses sentencepiece instead of the word based spacCy tokenizer which I use. On those tokens, the measured perplexity for that LM is ~35. I encourage you to check that work out as well.

Downloads

TODO

  • Language modeling based on wikipedia dump
  • Release Language Models: Hindi Language Model
  • Create Text classification Datasets: BBC Hindi
  • Benchmark text classification with FastText

Idea Dump

  • Change the custom head to be used for transliteration instead of classification, Hindi script (Devnagri) to English script (Roman)
  • MTL tasks for training and inference using custom heads
  • Text to Speech - using datasets from news recordings or Hindi subtitles of dubbed movies

FastAI Installation

This version of the notebook uses fastai lib's v0.7, used in their Part 2 v2 course in Summer 2018. The best way to install it via conda as mentioned here

Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.

About

State-of-the-Art Language Modeling and Text Classification in Hindi Language

http://nirantk.com/hindi2vec

License:MIT License


Languages

Language:Jupyter Notebook 97.2%Language:Python 2.8%