zarmeen92 / indic_tagger

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Indic Tagger

In this project, we build part-of-speech (POS) taggers and chunkers for Indian Languages.

Languages supported: Telugu, Hindi, Tamil, Marathi, Punjabi, Kannada, Malayalam, Urdu, Bengali

If you reuse this software, please use the following citation:

@inproceedings{PVS:SPSAL2007,
  editor    = {P.V.S., Avinesh and Gali, Karthik},
  title     = {Proceedings of the  Shallow Parsing for South Asian Languages (SPSAL) Workshop, held at IJCAI-07, Hyderabad, India},
  series    = {{SPSAL} Workshop Proceedings},
  month     = {January},
  year      = {2007},
  pages     = {21--24},
}

Training Data Statistics and System Performances (F1 macro)

Languages # Words # Sents CRF POS CRF Chunk
tel 347k 30k 93% 96%
hin 350k 16.3k 93% 97%
ben 298.3k 14.6k 84% 95%
pun 152.5k 5.6k 92% 98%
mar 207.9k 8.5k 89% 95%
urd 158.9k 7.6k 90% 96%
tam 337k 14.2k 88% 92%
mal 192k 11.4k 96% 95%
kan 294.3k 16.5k 90% 98%

Install

pip install -r requirements

pip install git+git://github.com/irshadbhat/indic-tokenizer.git

Run

python pipeline.py -p predict -l tel -t pos -m crf -f txt -e utf -i input_file -o output_file

-l, --languages       select language (3 letter ISO-639 code) 
                      {hin, ben, mal, pun, tel, tam, kan, mar, urd}
-t, --tag_type      	pos, chunk, parse
-m, --model_type    	crf, hmm, cnn, lstm
-f, --data_format   	ssf, tnt, txt, conll
-e, --encoding      	utf8, wx   (default: utf8)
-i, --input_file      <input-file>
-o, --output_file     <output-file>
-s, --sent_split      True/False (default: True)

python pipeline.py --help 

To Train:
python pipeline.py -p train -o outputs -l tel -t pos -m crf -e utf -f ssf

To Test Spacy Model:
python spacy_tagger_test.py -l te -t pos

ToDo List

  • Telugu, Hindi trained CRF models
  • Bengali, Punjabi, Marathi, Urdu, Tamil trained CRF models
  • Bug: Utf-8 error Malayalam, Kannada trained CRF models
  • Bug: Punjabi & Urdu training file doesn't have "|" (or) end of sentence marker.
  • HMM trained
  • Maximum Entropy
  • Deep learning (CNN, LSTM, BI-LSTM)
  • Analysis Comparision w.r.t other ML algorithms

About

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

License:Apache License 2.0


Languages

Language:Python 100.0%