duonghuuphuc / hate-speech-detection

Resources for CSoNet-2021 paper: Detecting Hate Speech Contents Using Embedding Models

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Detecting Hate Speech Contents Using Embedding Models

This repository is for the Detecting Hate Speech Contents Using Embedding Models paper. The paper has been submitted to the CSoNet 2021 conference and is under review. The public version of this paper will be available on arXiv soon.

Source Code

The src directory contains the source code in IPYNB format. The notebooks were originally created in Google Colab, you can either download and edit them locally in a Jupyter notebook or run them in Google Colab environment. We have already included instructions in the notebooks.

Datasets

The data directory contains three datasets that were used to evaluate the proposed model, including the HASOC-2019, HSOF-3 and HS2-2021 datasets. We note that:

  • The HASOC-2019 dataset has 5,853 training instances and 1,154 test instances; each instance is labeled as hate speech or not.
  • The HSOF-3 dataset has 24,802 instances and three labels, i.e., hate speech, offensive language, and neither.
  • The HS2-2021 dataset has 23,169 instances labeled as hate speech, and the rest have 8,619 instances.

Hate Speech Dictionary

The hate speech dictionary is available in the dictionary directory, and the current version contains 766 terms.

Model Parameters

We report the number of training parameters in millions for each experimental setup. The first three experiments only consider word embeddings which are generated by the word2vec model. The 4, 5, 6 experiments combine word embeddings and hate speech embeddings. We also fine-tune the BERTweet model for comparison purposes. We consider three sorts of neural network models, i.e., multilayer perceptron (MLP), BiLSTM, CNN.

# Models HASOC-2019 HSOF-3 HS2-2021
1 WE + MLP 3.5 5.2 7.1
2 WE + CNN 4 5.7 7.6
3 WE + BiLSTM 3.7 5.4 7.3
4 [WE + HSE] + MLP 3.5 5.2 7.1
5 [WE + HSE] + CNN 4 5.7 7.6
6 [WE + HSE] + BiLSTM 3.7 5.4 7.3
7 BERTweet + Softmax 135 135 135

Authors

  • Phuc H. Duong, Cuong C. Chung, Loc T. Vo (AI-LAB, Faculty of Information Technology, Ton Duc Thang University, Vietnam).
  • Hien T. Nguyen (Department of Economic Mathematics, Banking University of Ho Chi Minh City, Vietnam).
  • Dat Ngo (NewAI Research, Vietnam).

About

Resources for CSoNet-2021 paper: Detecting Hate Speech Contents Using Embedding Models

License:MIT License


Languages

Language:Jupyter Notebook 100.0%