gentaiscool / indonesian-nlp

A curated list of research papers and resources on Indonesian languages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Indonesian NLP Resources

This is the list of tutorials, workshops, talks, books, papers, and resources on computational linguistic approaches to research in Indonesian languages. The list will be updated over time. You are welcome to send a pull request to update the list and be one of the contributors! πŸš€

πŸ“Œ If you are working on any work related to Indonesian or any local Indonesian languages, don't hesitate to contact me or send a pull request!

πŸ“” Books

  • Jan Wira Gotama Putra (2019) Pengenalan Konsep Pembelajaran Mesin dan Deep Learning (in Indonesian). [Book]

πŸ”‰ Talks

  • Bedah Paper Series by INACL (in Indonesian) [Video]

πŸ“‘ Research Papers

Position / Survey

  • Aji, et al. (2022) One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia. ACL [Paper]

Datasets and Pretrained Models

Public Benchmark

  • Winata, et al. (2022) NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages. Preprint [Paper] [Benchmark]
  • Cahyawijaya, et al. (2021) IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation. EMNLP [Paper] [Benchmark] [Huggingface Models]
  • Wibowo, et al. (2021) IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words. ACL Findings [Paper] [Benchmark]
  • Koto, et al. (2020) IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. COLING [Paper] [Benchmark]
  • Fajri Koto, and Ikhwan Koto (2020) Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation. PACLIC [Paper] [Benchmark]
  • Wilie, et al. (2020) IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. AACL [Paper] [Benchmark] [Huggingface Models]

Language-Specific Model

  • Wongso, et al. (2022) Pre-Trained Transformer-Based Language Models for Sundanese. Journal of Big Data [Paper]

Morphology Analysis

  • Pimentel, et al. (2021) SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages. Workshop on Computational Research in Phonetics, Phonology, and Morphology [Paper] [Dataset]

POS Tagging

  • Devin Hoesen and Ayu Purwarianti (2018) Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger. International Conference on Asian Language Processing [Paper] [Benchmark]
  • Dinakaramani, et al. (2014) Designing an Indonesian Part of speech Tagset and Manually Tagged Indonesian Corpus. International Conference on Asian Language Processing [Paper] [Dataset]

Named Entity Recognition

  • Devin Hoesen and Ayu Purwarianti (2018) Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger. International Conference on Asian Language Processing [Paper] [Benchmark]
  • Muhammad Fachri (2014) Named Entity Recognition for Indonesian Text using Hidden Markov Model. Undergraduate Thesis [Paper] [Dataset]
  • Alfina, et al. (2016) DBpedia Entities Expansion in Automatically Building Dataset for Indonesian NER. International Conference on Advanced Computer Science and Information Systems [Paper] [Dataset]

Word Sense Disambiguation

  • Mahendra, et al. (2018) Cross-Lingual and Supervised Learning Approach for Indonesian Word Sense Disambiguation Task. Global Wordnet Conference [Paper] [Dataset]

Constituency Parsing

  • Arwidarasti, et al. (2019) Converting an Indonesian Constituency Treebank to the Penn Treebank Format. International Conference on Asian Language Processing [Paper] [Dataset]
  • Moeljadi, et al. (2018) Building Cendana: a Treebank for Informal Indonesian. Global Wordnet Conference [Paper] [Dataset]
  • David Moeljadi (2017) Building JATI: A Treebank for Indonesian. Global Wordnet Conference [Paper] [Dataset]

Dependency Parsing

  • Zeman, et al. (2018) CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. CoNLL Shared Task [Paper] [Dataset]
  • McDonald, et al. (2013) Universal Dependency Annotation for Multilingual Parsing. ACL [Paper] [Dataset]

Coreference Resolution

  • Artari, et al. (2021) A Multi-Pass Sieve Coreference Resolution for Indonesian. RANLP [Paper] [Dataset]

Chatbot

Question Answering

  • Clark, et al. (2020) TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. TACL [Paper] [Dataset]
  • Purwarianti, et al. (2007) A Machine Learning Approach for Indonesian Question Answering System. RANLP [Paper] [Benchmark]

Summarization

  • Kemal Kurniawan and Samuel Louvan (2018) A New Benchmark Dataset for Indonesian Text Summarization. International Conference on Asian Language Processing [Paper] [Benchmark] [Dataset]
  • Koto, et al. (2020) A Large-scale Indonesian Dataset for Text Summarization. AACL [Paper] [Benchmark] [Dataset]

Keyphrase Extraction

  • Mahfuzh, et al. (2019) Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features. International Conference of Advanced Informatics: Concepts, Theory and Applications [Paper] [Benchmark]

Natural Language Inference

  • Mahendra, et al. (2021) IndoNLI: A Natural Language Inference Dataset for Indonesian. EMNLP [Paper] [Dataset]
  • Ken Nabila Setya and Rahmad Mahendra (2018) Semi-supervised Textual Entailment on Indonesian Wikipedia Data. International Conference on Computational Linguistics and Intelligent Text Processing [Paper] [Benchmark]

Sentiment Analysis

  • Ayu Purwarianti and Ida Ayu Putu Ari Crisdayanti (2019) Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector. International Conference of Advanced Informatics: Concepts, Theory and Applications [Paper] [IndoNLU Benchmark] [NusaX Benchmark]
  • Azhar, et al. (2019) Multi-label Aspect Categorization with Convolutional Neural Networks and Extreme Gradient Boosting. International Conference on Electrical Engineering and Informatics [Paper] [Benchmark]
  • Ilmania, et al. (2018) Aspect Detection and Sentiment Classification Using Deep Neural Network for Indonesian Aspect-Based Sentiment Analysis. International Conference on Asian Language Processing [Paper] [Benchmark]

Emotion Classification

  • Saputri, et al. (2018) Emotion Classification on Indonesian Twitter Dataset. International Conference on Asian Language Processing [Paper] [Dataset]

Stance Detection

  • Jannati, et al. (2018) Stance Classification Towards Political Figures on Blog Writing. International Conference on Asian Language Processing [Paper] [Dataset]

Hate Speech Detection

  • Alfina, et al. (2017) Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study. International Conference on Advanced Computer Science and Information Systems [Paper] [Dataset]
  • Muhammad Okky Ibrohim and Indra Budi (2018) A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media. International Conference on Computer Science and Computational Intelligence [Paper] [Dataset]
  • Muhammad Okky Ibrohim and Indra Budi (2019) Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. Workshop on Abusive Language Online [Paper] [Dataset]

Clickbait Detection

  • Andika William and Yunita Sari (2020) CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines. Data in Brief [Paper] [Dataset]

Style Transfer

  • Wibowo, et al. (2020) Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation. International Conference on Asian Language Processing [Paper] [Dataset]

πŸ§ͺ Collaborative Project

IndoNLP is going to start collecting new datasets at https://github.com/orgs/IndoNLP. They will open the submission starting mid June 2022. Stay tuned!

About

A curated list of research papers and resources on Indonesian languages

License:Apache License 2.0