NLP Bahasa Indonesia Resources

This repository provides link to useful dataset and another resources for NLP in Bahasa Indonesia.

Last Update: 22 July 2020

Dictionary

Sentiment Words

Position / Degree Words

Root Words

I have made the combined root words list from all of the above repositories.

Slang Words

I have made the combined slang words dictionary from all of the above repositories.

Stop Words

I have made the combined stop words list from all of the above repositories.

Emoticon

Acronym

Indonesia Region

POS-Tagging

https://medium.com/@puspitakaban/pos-tagging-bahasa-indonesia-dengan-flair-nlp-c12e45542860
Manually Tagged Indonesian Corpus [Paper] [GitHub]

Pre-trained word embedding

Generate Word-Embedding / Sentence-Embedding using pre-Trained Multilingual Bert model. (https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=Zn0n2S-FWZih). P.S: Just change the model using 'bert-base-multilingual-uncased'
https://github.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset. [Paper]
https://github.com/Kyubyong/wordvectors
https://drive.google.com/uc?id=0B5YTktu2dOKKNUY1OWJORlZTcUU&export=download
https://github.com/deryrahman/word2vec-bahasa-indonesia
https://sites.google.com/site/rmyeid/projects/polyglot

Train Word Embedding by Your Self

Usable Library

Pujangga: Indonesian Natural Language Processing REST API. https://github.com/panggi/pujangga
Sastrawi Stemmer Bahasa Indonesia. https://github.com/sastrawi/sastrawi
NLP-ID. https://github.com/kumparan/nlp-id
MorphInd: Indonesian Morphological Analyzer. http://septinalarasati.com/morphind/
INDRA: Indonesian Resource Grammar. https://github.com/davidmoeljadi/INDRA
Typo Checker. https://github.com/mamat-rahmat/checker_id
Multilingual NLP Package. https://github.com/flairNLP/flair
spaCy [GitHub] [Tutorial]
https://github.com/yohanesgultom/nlp-experiments
https://github.com/yasirutomo/python-sentianalysis-id
https://github.com/riochr17/Analisis-Sentimen-ID
https://github.com/yusufsyaifudin/indonesia-ner

Topic Analysis

(Introduction to LSA & LDA). https://monkeylearn.com/blog/introduction-to-topic-modeling/
(Introduction to LDA w/ Code & Tips). https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
(Topic Modeling Methods Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(Original LDA Paper). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
(LDA Python Library). https://pypi.org/project/lda/; https://radimrehurek.com/gensim/models/ldamodel.html; https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
(Original CTM Paper). http://people.ee.duke.edu/~lcarin/Blei2005CTM.pdf
(CTM Python Library). https://pypi.org/project/tomotopy/; https://github.com/kzhai/PyCTM
(Gaussian LDA Paper). https://www.aclweb.org/anthology/P15-1077.pdf
(Gaussian LDA Library). https://github.com/rajarshd/Gaussian_LDA
(Temporal Topic Modeling Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(TOT: A Non-Markov Continuous-Time Model of Topical Trends Paper). https://people.cs.umass.edu/~mccallum/papers/tot-kdd06s.pdf
(TOT Library). https://github.com/ahmaurya/topics_over_time
(Example of LDA in Bahasa Project Code). https://github.com/kirralabs/text-clustering

Translation

Sometimes there is an english word within our text and we have to translate it. We can exploit the english word dictionary provided here and we can use the Google Translate API for Python

Spelling Correction

You can adjust this code with Bahasa corpus to do the spelling correction

Twitter Scraping:

GetOldTweets3. https://github.com/Mottl/GetOldTweets3

Usage:

import GetOldTweets3 as got
tweetCriteria=got.manager.TweetCriteria().setQuerySearch('#CoronaVirusIndonesia').setSince("2020-01-01").setUntil("2020-03-05").setNear("Jakarta, Indonesia").setLang("id")
tweets=got.manager.TweetManager.getTweets(tweetCriteria)
for tweet in tweets:
	print(tweet.username)
	print(tweet.text)
	print(tweet.date)
	print("tweet.to")
	print("tweet.retweets")
	print("tweet.favorites")
	print("tweet.mentions")
	print("tweet.hashtags")
	print("tweet.geo")

Tweepy. http://docs.tweepy.org/en/latest/

Step-by-step how to use Tweepy. https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1

Sign in to Twitter Developer. https://developer.twitter.com/en

Full List of Tweets Object. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

Increasing Tweepy’s standard API search limit. https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./

Other Resources:

https://github.com/irfnrdh/Awesome-Indonesia-NLP

About

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

MIT License