hariketsheth / Offensive-Span-Identification-in-Tamil---RANLP-2023

In order to encourage constructive online debates, content control is crucial on social media sites. In this group project, participants are asked to create systems to handle offensive stretches of code-mixed social media material in Tamil.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Offensive Span Identification in Tamil @RANLP-2023

Offensive Language Detection in dravidian languages (Tamil)

Faculty Slot Course Course Code
Dr. Ratnavel Rajalakshmi L33+L34 (G1 Slot) Essentials of Data Analytics CSE3506

Name Register Number Branch
Hariket Sukesh Kumar Sheth (Team Leader) 20BCE1975 CSE Core
Manasvi Maheshwari 20BAI1032 CSE AI & ML
Suraj Shah 20BRS1122 CSE Robotics



All of the work completed for the tasks related to Offensive Language Identification that RANLP 2023 organised on Codalab is included in this repository. To execute these programs, you must have the following:

  1. pytorch
  2. transformers
  3. sadice
  4. seaborn
  5. sklearn
  6. matplotlib

The pretrained transformers BERT, IndicBERT, and XLM-Roberta were employed for the job of Identifying Offensive Language. We have utilised modified versions of these models in addition to the original versions of the pretrained transformers. The customised versions were created by freezing the basic layers and then layering a fc layer on top of it with nll_loss and sadice loss custom loss routines.

In order to reproduce the results obtained you can clone this repository and place ur dataset path in the train scripts to run the same.


Our results for the Offensive Language Identification Task

Table: Results on Offensive Language Development Dataset Table: Results on Offensive Language Test Dataset
Model NameAccuracy
mBERT Cased0.76
XLMR0.76
IndicBERT0.74
XLMR with NLL Loss and Class Weights0.64
XLMR with Sadice Loss0.61
mBERT with Sadice Loss0.61
mBERT with NLL Loss and Class Weights0.58
Model NameAccuracy
mBERT Cased0.75
XLMR0.75
IndicBERT0.73
XLMR with NLL Loss and Class Weights0.64
XLMR with Sadice Loss0.61
mBERT with Sadice Loss0.61
mBERT with NLL Loss and Class Weights0.59

About

In order to encourage constructive online debates, content control is crucial on social media sites. In this group project, participants are asked to create systems to handle offensive stretches of code-mixed social media material in Tamil.


Languages

Language:Jupyter Notebook 100.0%