Toxic Spans Detection (SemEval 2021 Task 5)

The Toxic Spans Detection task concerns the evaluation of systems that detect the spans that make a text toxic, when detecting such spans is possible. Moderation is crucial to promoting healthy online discussions. Although several toxicity (a.k.a. abusive language) detection datasets (Wulczyn et al., 2017; Borkan et al., 2019) and models (Schmidt and Wiegand, 2017; Pavlopoulos et al., 2017b; Zampieri et al., 2019) have been released, most of them classify whole comments or documents, and do not identify the spans that make a text toxic. But highlighting such toxic spans can assist human moderators (e.g., news portals moderators) who often deal with lengthy comments, and who prefer attribution instead of just a system-generated unexplained toxicity score per post. The evaluation of systems that could accurately locate toxic spans within a text is thus a crucial step towards successful semi-automated moderation.

See more about this task here or directly on our Codalab site.

In this repository you will find a notebook with code to prepare a valid submission.
Evaluation code and baseline methods are included.
The trial, train and test data that were used in the 2021 SemEval challenge are also included.

About

Detect toxic spans in toxic texts

Creative Commons Zero v1.0 Universal

Languages

Language:Jupyter Notebook 54.4%Language:Python 45.6%