Ekkalak-T / nlp_thai_resources

More than 30+ collections of Thai Natural Language Processing libraries. Update daily.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Thai Natural Language Processing (Thai NLP) Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Thai NLP Libraries

Thai Character Cluster

Library Description Programming Languages Features License Author & Link
JTCC Thai Character Cluster Java GPL-3.0 Wittawat
TCC Thai Character Cluster Python Apache 2.0 Wannaphong

Thai Soundex

Library Description Programming Languages Features License Author & Link
LK82 + Udom83 Thai Soundex Python Korakot

Word Segmentation

Library Description Programming Languages Features License Author & Link
Swath SWATH (Smart Word Analysis for THai) is a word segmentation for Thai C Longest Matching, Maximal Matching and Part-of-Speech Bigram. GPL CMU
Lexto Lexto: Thai Lexeme Tokenizer Java LGPL NECTEC
Python 2 LGPL Python2 Wrapper
Python 3 LGPL Python3 Wrapper
Wordcut Thai word breaker for Node.js JavaScript, Node.JS LGPL-3.0 veer66, github
wordcutpy A simple Thai word tokenizer written in 1 Python file Python 3 LGPL-3.0 veer66, github
CutKum Thai Word-Segmentation with Deep Learning in Tensorflow. RNN. Python 0.93 F-measure. MIT Pucktada, github
DeepCut A Thai word tokenization library using Deep Neural Network. CNN. Python 0.988 F-measure. MIT rkcosmos, github
SynThai Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. Python 0.992 F-measure. MIT KenjiroAI, github

Part of Speech Tagging (POS Tagging)

Library Description Programming Languages Features License Author & Link
Jitar+NAiST A simple Trigram HMM part-of-speech tagger Java Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. Python 0.9163 F-measure. RNN. LSTM MIT KenjiroAI, github

Name Entity Recognition

Library Description Programming Languages Features License Author & Link
Named Entity Tagging (Thai NEST) Thai Named Entity tagging Specification and Tools GPL KINDML, SIIT, AIAT

News Structure Tagging

Library Description Programming Languages Features License Author & Link
News Structure Tagging Program Thai News Structure Tagging Program Metadata tagging, Structure tagging, Automatic News Title Generation GPL AIAT

Syntactic Parsing & Tools

Library Description Programming Languages Features License Author & Link
Chart-parser Extract Syntactic Structure from POS Tagged Sentence. C All rights reserved Thanaruk T. (thanaruk@siit.tu.ac.th)
Grammar Processing Labelled Brackets -> Context Free Grammars (CFGs) Python Transform and compute probability Thodsaporn C.

Thai Word Embedding

Library Description Programming Languages Features License Author & Link
kobkrit-word-embedding Tensorflow implementation of Thai word embedding Python Source code, Example, Word distance graph LGPL Kobkrit V.

Dictionaries / Translation Pairs

Library Description Size Features License Link
Transliteration Corpus 31K pairs Thai-Eng Translation Pair CC BY-NC-SA 3.0 TH NECTEC
Lexitron Opensource Thai-English Dictionary TH->EN, EN->TH LGPL NECTEC

Downloadable Text Corpus

Library Description Size Features License Link
ORCHID 30K sent. Word Seg., POS Tagged. CC BY-NC-SA 3.0 TH NECTEC
InterBEST 2009/2010 5M words Word Seg. CC BY-NC-SA 3.0 TH NECTEC
Thai Wikipedia Formal Articles 1.49GB (~213.1 MB compressed) XML GFDL WIKIPEDIA
TNC Top-5000 Words Word frequency 5,000 words Frequency of Thai words in various genres, EXCEL All rights reserved CHULA
Click Bait Sentences Thai Click Bait Sentence 330 sent. (90.7KB) MIT Wannaphongcom
Thai Sentimental Word List Thai Sentimental Words List 52KB Seperated Words as Adj, V MIT Wannaphongcom
Prime Minister 29 Prime Minister 29's Speech Sentences 338KB Word segged, Name Entity Tagged MIT Wannaphongcom

Web Query Text Corpus

Library Description Size Features License Link
Thai National Corpus 2 32M words. Query text by genre, domain All rights reserved CHULA
Thai Medical Document 3,594 docs Document and dynamic keyword map All rights reserved KINDML, SIIT
Southeast Asian Languages Library Thai News, Web Text, Pop Music, Literature, Toponyms 20M chars Phase around a search text SEALang

Pre-trained Word Vectors

Pre-trained Model Description Size Dimensions License Link
fastText Skip-Gram model trained on Wikipedia using fastText 300 CC BY-SA 3.0 Facebook + Bin & Text + Text Only

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

http://aiat.in.th/resources/

Acknowledgements

About

More than 30+ collections of Thai Natural Language Processing libraries. Update daily.