bert data-visualization huggingface indicbert nlp roberta-model ranlp2023

Offensive Span Identification in Tamil @RANLP-2023

Offensive Language Detection in dravidian languages (Tamil)

Faculty	Slot	Course	Course Code
Dr. Ratnavel Rajalakshmi	L33+L34 (G1 Slot)	Essentials of Data Analytics	CSE3506

Name	Register Number	Branch
Hariket Sukesh Kumar Sheth (Team Leader)	20BCE1975	CSE Core
Manasvi Maheshwari	20BAI1032	CSE AI & ML
Suraj Shah	20BRS1122	CSE Robotics

All of the work completed for the tasks related to Offensive Language Identification that RANLP 2023 organised on Codalab is included in this repository. To execute these programs, you must have the following:

pytorch
transformers
sadice
seaborn
sklearn
matplotlib

The pretrained transformers BERT, IndicBERT, and XLM-Roberta were employed for the job of Identifying Offensive Language. We have utilised modified versions of these models in addition to the original versions of the pretrained transformers. The customised versions were created by freezing the basic layers and then layering a fc layer on top of it with nll_loss and sadice loss custom loss routines.

In order to reproduce the results obtained you can clone this repository and place ur dataset path in the train scripts to run the same.

Our results for the Offensive Language Identification Task

Table: Results on Offensive Language Development Dataset

Table: Results on Offensive Language Test Dataset

Model Name	Accuracy
mBERT Cased	0.76
XLMR	0.76
IndicBERT	0.74
XLMR with NLL Loss and Class Weights	0.64
XLMR with Sadice Loss	0.61
mBERT with Sadice Loss	0.61
mBERT with NLL Loss and Class Weights	0.58

Model Name	Accuracy
mBERT Cased	0.75
XLMR	0.75
IndicBERT	0.73
XLMR with NLL Loss and Class Weights	0.64
XLMR with Sadice Loss	0.61
mBERT with Sadice Loss	0.61
mBERT with NLL Loss and Class Weights	0.59

About

In order to encourage constructive online debates, content control is crucial on social media sites. In this group project, participants are asked to create systems to handle offensive stretches of code-mixed social media material in Tamil.

bert data-visualization huggingface indicbert nlp roberta-model ranlp2023

Languages

Language:Jupyter Notebook 100.0%