WanzhengZhu / Euphemism

Self-Supervised Euphemism Detection and Identification for Content Moderation, IEEE S&P (Oakland) 2021

Home Page:https://arxiv.org/pdf/2103.16808.pdf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python 3.7 License: MIT

Self-Supervised Euphemism Detection and Identification for Content Moderation

This repo is the Python 3 implementation of Self-Supervised Euphemism Detection and Identification for Content Moderation (42nd IEEE Symposium on Security and Privacy 2021).

Table of Contents

Introduction

This project aims at Euphemism Detection and Euphemism Identification.

Requirements

The code is based on Python 3.7. Please install the dependencies as below:

pip install -r requirements.txt

Data

Due to the license issue, we will not distribute the dataset ourselves, but we will direct the readers to their respective sources.

Drug:

Weapon:

Sexuality:

Sample:

  • Raw Text Corpus: we provide a sample dataset data/sample.txt for the readers to run the code.
  • Ground Truth: same as the Drug dataset (see data/euphemism_answer_drug.txt and data/target_keywords_drug.txt).
  • This Sample dataset is only for you to play with the code and it does not represent any reliable results.

Code

1. Fine-tune the BERT model.

Please refer to this link from Hugging Face to fine-tune a BERT on a raw text corpus.

You may download our pre-trained BERT model on the reddit text corpus (from the Drug dataset) here. Please unzip it and put it under data/.

2. Euphemism Detection and Euphemism Identification

python ./Main.py --dataset sample --target drug  

You may find other tunable arguments --- c1, c2 and coarse to specify different classifiers for euphemism identification. Please go to Main.py to find out their meanings.

Baselines:

Please refer to baselines/README.md.

Acknowledgement

We use the code here for the text classification in Pytorch.

Citation

@inproceedings{zhu2021selfsupervised,
    title = {Self-Supervised Euphemism Detection and Identification for Content Moderation},
    author = {Zhu, Wanzheng and Gong, Hongyu and Bansal, Rohan and Weinberg, Zachary and Christin, Nicolas and Fanti, Giulia and Bhat, Suma},
    booktitle = {42nd IEEE Symposium on Security and Privacy},
    year = {2021}
}

About

Self-Supervised Euphemism Detection and Identification for Content Moderation, IEEE S&P (Oakland) 2021

https://arxiv.org/pdf/2103.16808.pdf

License:MIT License


Languages

Language:Python 100.0%