ddindidu / Korean-Abusive-Language-Dataset

Translated abusive language dataset (En2Ko). Including OffensEval/AbusEval, CADD, Davidson et al., Waseem&Hovy.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Korean Abusive Language Dataset | 한국어 언어폭력 혐오표현 데이터셋

This is the abusive language datasets (AbuseEval, CADD, Davidson, Waseem) translated into Korean.
keyword: abusive language, hate speech, offensive language, reddit, social media, korean dataset

For 4 benchmark datasets for abusive language detection, we translated and share them into Korean (You can access their papers through the references below.) .

  • AbuseEval (Caselli et al., 2020)
  • CADD (Song et al., 2021)
  • Davidson et al. (2017)
  • Waseem and Hovy (2016)

The datasets have different formats, so we unified the data columns.
We share two types for each dataset because each dataset has different columns. The description is below.

1. origin_*.csv

This is a Korean version of an original dataset. It preserves columns of the original datasets.

Here are the columns of datasets.

  • AbuseEval = {'Dataset', 'Id', 'Context', 'Comment', 'Target', 'abuse'}

  • CADD = {'Dataset', 'Id', 'Context', 'Comment', 'Target', 'L.Type', 'L.Abusive', 'lAttack', 'L.Dem', 'L.Implicit', 'L.Profanity', 'lenComment', 'lenContext'}

  • Davidson = {'Dataset', 'Id', 'Context', 'Comment', 'Target', 'hate_speech', 'offensive_language', 'neither', 'class'}

  • Waseem = {'Dataset', 'Id', 'Context', 'Comment', 'Target', 'Annotation'}

Details

5 columns which all datasets have.

'Dataset': the name of dataset
'Id'     : the Id of text data from the original dataset
'Context': context texts; If there is no context text in original data, data is "".
'Comment': texts to be classified
'Target' : binary labels that we refined original labels in our work {abusive (1), not abusive (0)}

Other columns

their original labels: refer their papers

2. model_*.csv

This is a dataset for training and testing a classification model.
consisting of only 5 columns {'Dataset', 'Id', 'Context', 'Comment', 'Target'}

Brief Tips🤗

1. Which file should I use?

If you want to train your model with original labels, use 'origin_*.csv' files.

If you want to train your model with {abusive, not abusive} labels, use 'model_*.csv' files.

We explain how we adjust the labels (origin labels -> binary labels) in our paper (please see the reference below).

2. How is the data divided (train / test)?

  • For AbuseEval and CADD, there are 'origin_{train, valid, test}.csv' and 'model_{train, valid, test}.csv'.

  • For Davidson and Waseem, they are not divided into {train, valid, test} sets in the original papers.
    You can split the data depending on the ratio you want, if you need to.
    In our paper, we set the ratio as 7(train):1(valid):2(test).
    For Waseem, we divided the data after shuffling it.

References

About

Translated abusive language dataset (En2Ko). Including OffensEval/AbusEval, CADD, Davidson et al., Waseem&Hovy.


Languages

Language:Python 100.0%