SocialMedia_Content_Filter

Backend for a Chrome Extension that will filter social media content based on sentiment and users' choice filter words. This repository contains the models and scripts used to create the models. The extension itself can be found here. The API repository can be found here.

Folders

Code

initial_cleaning.py - Script for taking the training.16000000.processed.noemoticon.csv data and turning it to bigram_tweet_df.csv
initial_modeling.py - Script for the sentiment modeling
preprocessing_script.py - Script for preprocessing the tweets, word filtering takes place here too
topic_modeling.py - Script for topic modeling (Unfinished - not used in final extension)
Capstone_Glove_Word_Embeddings.ipynb - Exploration of GloVE word vectors for classification
Capstone_Fasttext_Word_Embeddings.ipynb - Exploration of fasttext word vectors for classification
Tweet_Extraction_from_UMich_Tweets.ipynb - Extracting test set from random sample of tweets provided by UMSI
Capstone_LinearSVC_Model_Eval.ipynb - Evaluate LinearSVC model on test set derived from UMSI tweet sample

Data

bigram_tweet_df.csv - Cut dataset containing tweets
training.1600000.processed.noemoticon.csv - Cut dataset containing tokenized tweets
thesaurus.json - The original thesaurus file
thesaurus_lem.json - The lemmatized thesaurus file
sampled_tweets.csv - Tweets sampled from UMSI tweets using VADER algorithm for preliminary sentiment labeling
cleaned_human_responses.csv - Modified version of sampled_tweets with human labeling added

Models

LinearSVCModel.sav - Linear SVC Model Pickle
MNBModel.sav - Multinomial Naive Bayes Model Pickle
phrasemodel.sav - Phrase Model Pickle
SGDModel.sav - Stochastic Gradient Descent Model Pickle

Other

Images - Confusion Matrices and Accuracy vs Model Size Graph

About

Backend for a Chrome Extension that will filter social media content based on sentiment, politics, COVID, etc.

Languages

Language:HTML 70.5%Language:Jupyter Notebook 24.6%Language:Python 4.9%