In this project, I aim to classify tweets into three categories:
- Positive about the corona vaccine
- Negative about the corona vaccine
- Tweets that are unrelated to the corona vaccine
The tweet collection from social media can be downloaded from here.
I started by visualizing the data to gain insights and performed text preprocessing. Then, I extracted the Bag of Words feature using CountVectorizer and employed various machine learning models, including Logistic Regression as a baseline, as well as some deep learning models.
The following table shows the performance metrics of different models and preprocessing methods:
Model Name | Preprocessing Methods | Accuracy |
---|---|---|
Logistic Regression with Bag of Words | None | 0.761870 |
Logistic Regression with Bag of Words | Removing emojis, removing URLs, remove_tweets_... | 0.768495 |
Logistic Regression with TF-IDF | Removing emojis, removing URLs, remove_tweets_... | 0.768495 |
Model using LSTM and Embedding with Balance | Removing emojis, removing URLs, remove_tweets_... | 0.728400 |