gentaiscool / indonlu

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained models, and a starter code! (AACL-IJCNLP 2020)

Home Page:https://indobenchmark.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IndoNLU

Pull Requests Welcome GitHub license Contributor Covenant

IndoNLU is a collection of Natural Language Understanding (NLU) resources for Bahasa Indonesia with 12 downstream tasks. We provide the code to reproduce the results and large pre-trained models (IndoBERT and IndoBERT-lite) trained with around 4 billion word corpus (Indo4B), more than 20 GB of text data. This project was initially started by a joint collaboration between universities and industry, such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, Gojek, and Prosa.AI.

Research Paper

IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our preprint https://arxiv.org/abs/2009.05387. If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}

How to contribute to IndoNLU?

Be sure to check the contributing guidelines and contact the maintainers or open an issue to collect feedbacks before starting your PR.

12 Downstream Tasks

  • You can check [Link]
  • We provide train, valid, and test sets. The labels of the test set are masked (no true labels) in order to preserve the integrity of the evaluation. Please submit your predictions to the submission portal at CodaLab

Examples

  • A guide to load IndoBERT model and finetune the model on Sequence Classification and Sequence Tagging task.
  • You can check link

Submission Format

Please kindly check the link. For each task, there is different format. Every submission file always start with the index column (the id of the test sample following the order of the masked test set).

For the submission, first you need to rename your prediction into pred.txt, then zip the file. After that, you need to allow the system to compute the results. You can easily check the progress in your results tab.

Indo4B Dataset

We provide the access to our large pretraining dataset. In this version, we exclude all Twitter tweets due to restrictions of the Twitter Developer Policy and Agreement.

  • Indo4B Dataset (23 GB uncompressed, 5.6 GB compressed) [Link]

IndoBERT and IndoBERT-lite Models

We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [Link]

FastText (Indo4B)

We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)

  • FastText model (11.9 GB) [Link]
  • Vector file (3.9 GB) [Link]

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks

Leaderboard

About

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained models, and a starter code! (AACL-IJCNLP 2020)

https://indobenchmark.com

License:MIT License


Languages

Language:Python 68.6%Language:Jupyter Notebook 27.6%Language:Shell 3.8%