KevinyWu / javanese-hate-speech

A Deep Learning Approach to Abusive Language and Hate Speech Detection for the Javanese Language

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A Deep Learning Approach to Abusive Language and Hate Speech Detection for the Javanese Language

Abstract

This paper develops a deep learning approach to abusive language and hate speech detection using Javanese and Indonesian large language models (LLMs). We experiment on a Javanese Twitter dataset created by Putri et al., aiming to beat their best F-measure of 0.780. Using a fine-tuned Javanese GPT-2 as a feature extractor for our classifier, the model achieves an F-measure of 0.811. Surprisingly, utilizing an Indonesian GPT-2 as the feature extractor yields a superior F-measure 0.854, potentially attributable to code-mixing in Javanese Twitter data or the model’s training on colloquial language. This study further explores the nuances of hate speech detection in Javanese, emphasizing language and model choice.

drawing

Please see our paper.

Code

To run the code please follow the instructions:

  1. Clone the repository
  2. Install the requirements in requirements.txt
  3. Run data_preparation.ipynb to clean and split the data
  4. Run javanese_experiments.ipynb to train and evaluate the models (GPU is recommended)
  5. See model_analysis.ipynb for further analysis of the best model, Indonesian GPT-2

About

A Deep Learning Approach to Abusive Language and Hate Speech Detection for the Javanese Language

License:MIT License


Languages

Language:Jupyter Notebook 97.0%Language:Python 2.8%Language:TeX 0.2%