NLP Bahasa Indonesia Resources

This repository provides link to useful dataset and another resources for NLP in Bahasa Indonesia.

Last Update: 10 Apr 2021

Dictionary

Sentiment Words

Position / Degree Words

Root Words

I have made the combined root words list from all of the above repositories.

Slang Words

I have made the combined slang words dictionary from all of the above repositories.

Stop Words

I have made the combined stop words list from all of the above repositories.

Emoticon

Acronym

Indonesia Region

POS-Tagging Dataset

Question and Answering Dataset

https://github.com/google-research-datasets/tydiqa

Hate-speech Dataset

https://github.com/okkyibrohim/id-multi-label-hate-speech-and-abusive-language-detection

Text Summarization Dataset

Paraphrase Dataset

https://github.com/Wikidepia/indonesian_datasets/tree/master/paraphrase/paws

Analogy Word Dataset

https://github.com/kata-ai/kawat

Formal-Informal Dataset

https://github.com/haryoa/stif-indonesia

Multilingual Parallel Dataset

Unsupervised Corpus

OSCAR. https://oscar-corpus.com/
Online Newspaper. https://github.com/feryandi/Dataset-Artikel
IndoNLU. https://huggingface.co/datasets/indonlu
http://data.statmt.org/cc-100/
https://huggingface.co/datasets/id_clickbait
https://huggingface.co/datasets/id_newspapers_2018
https://opus.nlpl.eu/QED.php

Voice-Text Dataset

Puisi & Pantun dataset

https://github.com/ilhamfp/puisi-pantun-generator

POS-Tagging

https://medium.com/@puspitakaban/pos-tagging-bahasa-indonesia-dengan-flair-nlp-c12e45542860
Manually Tagged Indonesian Corpus [Paper] [GitHub]

Pre-trained NLU Model

Indo-BERT. https://github.com/indobenchmark/indonlu & https://huggingface.co/indobenchmark/indobert-base-p1
Transformer-based Pre-trained Model in Bahasa. https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers
Generate Word-Embedding / Sentence-Embedding using pre-Trained Multilingual Bert model. (https://colab.research.google.com/drive/1yFphU6PW9Uo6lmDly_ud9a6c4RCYlwdX#scrollTo=Zn0n2S-FWZih). P.S: Just change the model using 'bert-base-multilingual-uncased'
https://github.com/meisaputri21/Indonesian-Twitter-Emotion-Dataset. [Paper]
https://github.com/Kyubyong/wordvectors
https://drive.google.com/uc?id=0B5YTktu2dOKKNUY1OWJORlZTcUU&export=download
https://github.com/deryrahman/word2vec-bahasa-indonesia
https://sites.google.com/site/rmyeid/projects/polyglot

Train Word Embedding by Your Self

Usable Library

Pujangga: Indonesian Natural Language Processing REST API. https://github.com/panggi/pujangga
Sastrawi Stemmer Bahasa Indonesia. https://github.com/sastrawi/sastrawi
NLP-ID. https://github.com/kumparan/nlp-id
MorphInd: Indonesian Morphological Analyzer. http://septinalarasati.com/morphind/
INDRA: Indonesian Resource Grammar. https://github.com/davidmoeljadi/INDRA
Typo Checker. https://github.com/mamat-rahmat/checker_id
Multilingual NLP Package. https://github.com/flairNLP/flair
spaCy [GitHub] [Tutorial]
https://github.com/yohanesgultom/nlp-experiments
https://github.com/yasirutomo/python-sentianalysis-id
https://github.com/riochr17/Analisis-Sentimen-ID
https://github.com/yusufsyaifudin/indonesia-ner

Translation

Sometimes there is an english word within our text and we have to translate it. We can exploit the english word dictionary provided here and we can use the Google Translate API for Python

Spelling Correction

You can adjust this code with Bahasa corpus to do the spelling correction

Topic Analysis

(Introduction to LSA & LDA). https://monkeylearn.com/blog/introduction-to-topic-modeling/
(Introduction to LDA w/ Code & Tips). https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
(Topic Modeling Methods Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(Original LDA Paper). http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
(LDA Python Library). https://pypi.org/project/lda/; https://radimrehurek.com/gensim/models/ldamodel.html; https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
(Original CTM Paper). http://people.ee.duke.edu/~lcarin/Blei2005CTM.pdf
(CTM Python Library). https://pypi.org/project/tomotopy/; https://github.com/kzhai/PyCTM
(Gaussian LDA Paper). https://www.aclweb.org/anthology/P15-1077.pdf
(Gaussian LDA Library). https://github.com/rajarshd/Gaussian_LDA
(Temporal Topic Modeling Comparison Paper). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf
(TOT: A Non-Markov Continuous-Time Model of Topical Trends Paper). https://people.cs.umass.edu/~mccallum/papers/tot-kdd06s.pdf
(TOT Library). https://github.com/ahmaurya/topics_over_time
(Example of LDA in Bahasa Project Code). https://github.com/kirralabs/text-clustering

Text Classification

Zero-shot Learning

(Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach) https://arxiv.org/pdf/1909.00161.pdf | https://github.com/yinwenpeng/BenchmarkingZeroShot
(Integrating Semantic Knowledge to Tackle Zero-shot Text Classification) https://arxiv.org/abs/1903.12626 | https://github.com/JingqingZ/KG4ZeroShotText
(Train Once, Test Anywhere: Zero-Shot Learning for Text Classification) https://arxiv.org/abs/1712.05972 | https://amitness.com/2020/05/zero-shot-text-classification/
(Zero-shot Text Classification With Generative Language Models) https://arxiv.org/abs/1912.10165 | https://amitness.com/2020/06/zero-shot-classification-via-generation/
(Zero-shot User Intent Detection via Capsule Neural Networks) https://arxiv.org/abs/1809.00385 | https://github.com/congyingxia/ZeroShotCapsule

Few-shot Learning

(Few-shot Text Classification with Distributional Signatures) https://arxiv.org/pdf/1908.06039.pdf | https://github.com/YujiaBao/Distributional-Signatures
(Few Shot Text Classification with a Human in the Loop) https://katbailey.github.io/talks/Few-shot%20text%20classification.pdf | https://github.com/katbailey/few-shot-text-classification
(Induction Networks for Few-Shot Text Classification) https://arxiv.org/pdf/1902.10482v2.pdf | https://github.com/zhongyuchen/few-shot-learning

Twitter Scraping:

GetOldTweets3. https://github.com/Mottl/GetOldTweets3

Usage:

import GetOldTweets3 as got
tweetCriteria=got.manager.TweetCriteria().setQuerySearch('#CoronaVirusIndonesia').setSince("2020-01-01").setUntil("2020-03-05").setNear("Jakarta, Indonesia").setLang("id")
tweets=got.manager.TweetManager.getTweets(tweetCriteria)
for tweet in tweets:
	print(tweet.username)
	print(tweet.text)
	print(tweet.date)
	print("tweet.to")
	print("tweet.retweets")
	print("tweet.favorites")
	print("tweet.mentions")
	print("tweet.hashtags")
	print("tweet.geo")

Tweepy. http://docs.tweepy.org/en/latest/

Step-by-step how to use Tweepy. https://towardsdatascience.com/how-to-scrape-tweets-from-twitter-59287e20f0f1

Sign in to Twitter Developer. https://developer.twitter.com/en

Full List of Tweets Object. https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

Increasing Tweepy’s standard API search limit. https://bhaskarvk.github.io/2015/01/how-to-use-twitters-search-rest-api-most-effectively./

Other Resources:

About

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

MIT License