puskini33 / sentiment-analysis-tweets

This project is developed in line with the Curriculum of the Frauenloop Intermediary Course in Machine Learning.

Home Page:https://www.frauenloop.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sentiment Analysis of Tweets

Quick Links

Introduction

This project is part of the Course Curriculum in Machine Learning at Frauenloop. The course was taught over the course of 3 months and presumed setting up a project in natural language processing (NLP). Some of the concepts that were taught include:
+ types of NLP analyses,
+ connecting to an API and retrieving data,
+ data processing with spacy,
+ feature extraction with tf-idf
+ creating a pipeline with sklearn for modeling
+ evaluating model performance

I choose to focus on the sentiment analysis of tweets that were retrieved with the Twitter API Twython. The goal of my project is to predict the sentiment of tweets that are about Donald Trump. I made a list of 8 companies that openly support or are against Donald Trump, and I retrieved from the company's Twitter page 10.000 tweets. I assumed that the companies that openly support Donald Trump would tweet more positive tweets about him, and the companies that are against Donald Trump would tweet more negative tweets.

Installation Guide

To set up your local environment before starting to work on the project, follow the steps:

  1. Install venv library. Type in the terminal: pip install venv

  2. Create virtual environment venv inside the root folder: python -m venv venv

  3. Activate venv

    Linux and MacOs: source venv/bin/activate

    Windows: venv/Scripts/Activate.ps1 or cd to Scripts folder and type activate

  4. Upgrade pip: python -m pip install --upgrade pip

  5. Install requirements: pip install -r requirements.txt

  6. Install the spacy model: python -m spacy download en_core_web_sm

  7. To request the data files, write me at: elenahirjoaba@gmail.com

Dataset

The dataset is composed of 90.000 tweets retrieved from different companies'tweeter pages that openly support or are against Donald Trump. The list of companies is: Uline, Home_Depot, CNN, Taco_Bell, Bang_Energy, Patagonia, Microsoft, Merriam_Webster, Fox_News.

File Preparation

The scripts for file preparation can be found in src.data.file_preparation. The file contains functions to create a new raw_ and processed_file with headers, to write to .csv file, and to get relative file path.

Data Retrieval

The scripts for data retrieval can be found in src.data.data_retrieval. You can stream data specifying the keyword of the search, the id of the tweeter page, the number of minutes you want to stream, and the number of tweets you want. The data is saved in data.raw.

Data Preprocessing

The script for data preprocessing can be found in src.data.data_preprocessing. The processed .csv files can be found in data.processed. Stop words, urls, handles, punctuation are removed, and the emoji is transformed to a string according to the category it belongs to: EPOS or ENEG. The valence of the emoji is subjectively set by me. I separated positive and negative emojis in 2 lists. Data_Preprocessed

Data Labeling

The script for data labeling can be found in src.data.data_labeling. The .csv file with labeled data can be found in data.processed. I set 2 labels to evaluate the sentiment of the tweet: positive and negative. I first did a TF-IDF analysis on the corpus of words, then I manually categorized the most frequent words into a positive words list and a negative words list. I set the label of the tweet based on the max count of positive and negative words within the tweet. The tweets with no or equal number of positive/negative words in the tweet were discarded. Data_Labeled

Feature Extraction

The script for data labeling can be found in src.features.feature_extraction. I applied a tf-idf analysis to retrieve feature.

Modeling

Limitations

Some limitations of the project and the analysis that was made includes: subjective evaluation of the sentiment of the emoji in the text and the valence of most common words in the corpora that were used for setting the label of the tweet.

About

This project is developed in line with the Curriculum of the Frauenloop Intermediary Course in Machine Learning.

https://www.frauenloop.org/

License:GNU Affero General Public License v3.0


Languages

Language:Python 100.0%