This project is still in progress
The data source is the following: https://www.kaggle.com/gpreda/pfizer-vaccine-tweets
This data set from Kaggle contains tweets about COVID19 vaccines with lists of Hashtags. The goal of this project is classifying these tweets to distinguish health-oriented, political and junk knowledge based on NLP algorithms.
In order to achieve this, we follow three steps:
-
Text processing and data preparation: in this step we format the data, delete GDPR related information, extract links from tweets and check some data features
-
Assigning categories: this is manual process which plays a role in the prediction quality. We have to classify the tweets directly one-by-one, which is a time-consuming process:
Approval | 230 | % 7.72 Business | 50 | % 1.68 Health | 127 | % 4.26 Junk Knowledge | 224 | % 7.52 Other | 204 | % 6.85 Politics | 154 | % 5.17 Procurement and Logistics | 231 | % 7.76 Unique vaccination | 780 | % 26.19 Vaccination campaign | 418 | % 14.04 Vaccine | 400 | % 13.43 Side effects | 160 | % 5.37 DISCLAIMER: these classifications are the result of a manual effort whithout medical proper knowledge. Medical professionals may have a different opinion.
-
Once categories are assigned, an NLP vectorizer is used to predict the tweets.
At the end we have a classifier to predict whether the tweet has real informative value in regards to COVID19 and helps to eliminate all junk-science or any other irrelevant information.