ddhruvkr/Online_Abuse_Detection

Code for the paper "Online abuse detection: the value of preprocessing and neural attention models", in NAACL workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA), Jun 2019

Please cite this paper if you use our code or system output.

@inproceedings{kumar-etal-2019-online,
    title = "Online abuse detection: the value of preprocessing and neural attention models",
    author = "Kumar, Dhruv  and
      Cohen, Robin  and
      Golab, Lukasz",
    booktitle = "Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis",
    month = jun,
    year = "2019",
    address = "Minneapolis, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-1303",
    doi = "10.18653/v1/W19-1303",
    pages = "16--24",
}

UPDATE:

Results when using Glove+Elmo embeddings with the baseline BiRNN model. As expected, these improve the performance when compared to using just Glove embeddings and give higher scores than what are reported in the paper. Using only Elmo embeddings performed worse. Embeddings were not tuned. All other parameters were kept as same. Thus, suggest using these embeddings for the future.

Dataset	Embedding	Minority P	Minority R	Minority F1	Overall F1
W-Tox	Glove(300)	83.49	78.69	81.02	89.47
W-Tox (CoAttn)	Glove(300)	83.67	79.42	81.49	89.76
W-Tox	Elmo(1024)	81.44	81.25	81.35	89.68
W-Tox	Glove+Elmo(256)	83.84	79.73	81.73	89.94
W-Tox	Glove+Elmo(1024)	83.55	81.31	82.41	90.29
W-At	Glove(300)	83.43	74.81	78.89	88.03
W-At (CoAttn)	Glove(300)	81.42	77.62	79.47	88.34
W-At	Elmo(1024)	82.35	76.45	79.29	88.27
W-At	Glove+Elmo(256)	84.02	75.84	79.72	88.71
W-At	Glove+Elmo(1024)	83.13	77.75	80.35	88.93
W-Ag	Glove(300)	82.32	73.37	77.59	87.22
W-Ag (Attn)	Glove(300)	81.57	75.13	78.22	87.49
W-Ag	Elmo(1024)	80.95	75.53	78.14	87.55
W-Ag	Glove+Elmo(256)	82.82	74.92	78.67	87.98
W-Ag	Glove+Elmo(1024)	82.7	76.07	79.25	88.21

Put the tsv dataset files inside Data/Wikipedia/toxicity (attack, aggression)

Also create a folder structure Embeddings/Glove/ outside this folder to have the embeddings file.

The code can be run with the following commands when using only the Glove embeddings. Replace main.py with main_elmo.py when using Glove+Elmo embeddings.

python main.py -embedding glove -dataset toxicity -emb_dim=300 -hidden_dim=150 -model CoAttn -lr=0.001 -epochs=3 -lstm_size=1 -batch_size=256 -sequence 175 -dropout_prob=0.1 -file_extension demo

python main.py -embedding glove -dataset attack -emb_dim=300 -hidden_dim=150 -model CoAttn -lr=0.001 -epochs=3 -lstm_size=1 -batch_size=200 -sequence 175 -dropout_prob=0.1 -file_extension demo

python main.py -embedding glove -dataset aggression -emb_dim=300 -hidden_dim=150 -model Attn -lr=0.001 -epochs=3 -lstm_size=1 -batch_size=200 -sequence 175 -dropout_prob=0.1 -file_extension demo

Requirements:

Pytorch 1.0

Keras (Tensorflow 1.10, Only using it for padding, will remove this dependency)

Ekphrasis

mosestokenizer

sklearn

nltk

TODO:

Still using Keras functions for padding. Replace them with Pytorch function.
Possibly try BERT (should perform even better).

About

Code for the paper "Online abuse detection: the value of preprocessing and neural attention models", in NAACL workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA), Jun 2019

Languages

Language:Python 100.0%