nlp4if

Attribute Classification of COVID-19-Related Tweets Based on Natural Language Processing Models (Student Research Training Program)

Introduction

Our work is based on NLP4IF-Workshop--Shared-Task-On-Fighting the COVID-19 Infodemic.

The major task is to predict a series of binary attributes of COVID-19 Twitter from seven aspects. The first, sixth and seventh questions are whether it is a Verifiable Factual Claim, whether it is Harmful to Society and whether it Requires Attention. The second, third, fourth and fifth questions are based on the first question. If it is a factual statement, then it is necessary to further judge whether it is False Information, whether it arouses Interest to General Public, what is the Harmfulness and Need of Verification.

This is a multi task problem, and there is a dependency between tasks.

The dataset includes Twitter in English, Bulgarian and Arabic.

Example of raw data

Because it is the real comment on Twitter, it inevitably contains emoji and URL, which brings some challenges to data preprocessing.

Pipline

Inspired by the design and ideas in Multi Output Learning using Task Wise Attention for Predicting Binary Properties of Tweets : Shared-Task-On-Fighting the COVID-19 Infodemic that ranked second in the competition at that time, we established our baseline and made further improvements to the training pipeline.

Data Augmentation

Adopted by keeping labels on different language training datasets and mutual translation in the data preprocessing stage.

Pre-training

Bert, RoBERTa, XLM-RoBERTa models are used in the pre-training stage.

Classifier

BiLSTM+Attn, TextCNN, MultiHead Attn models are utilized in the classifier.

Loss Function

Loss function with biased weights is improved on the basis of the Uniform weights in the original paper.

Ensemble

Finally, we propose a voting mechanism. There are two schemes: All vote and Top6 vote.

Results

After the attempt and optimization, taking Mean F1 Core as the standard, we trained 12 models, some of which far surpassed the best average F1-score of 89.7% in Fighting the COVID-19 Infodemic with a Holistic BERT Ensemble, including Roberta-lstm-attn (91.38%), xlmRoberta-lstm-attn (91.09%), xlmRoberta-lstm-attn-biasedWeight (90.67%), xlmRoberta-multihead (90.49%), etc., optimized the training result by adopting voting mechanism and reached an ultimate best-vote of 93.54%.

haoyi-duan / nlp4if