sevdari / ML_Project_CIS419

Sentiment Analysis: This project is a deliverable for the CIS419 course at the University of Pennsylvania.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CIS 4190 Applied Machine Learning Project: Group 045

For this project, we were interested in studying sentiment prediction in NLP. Sentiment analysis is an important tool for organizations and businesses, as they seek to understand large amounts of text data.

We were primarily interested in seeing how sentiments towards food were reflected in review data. For this project, we used the Amazon Fine Foods Dataset, taking in the review texts as the raw inputs to our model and trying to predict whether or not the reviews were overall positive or negative.

A complete description of the project can be found here.

Instructions on how to run each file:

lstm.ipynb

The following files should be available in the same directory:

  • Reviews.csv: This file can be downloaded from the Kaggle link above.
  • glove.840B.300d.txt: It can be downloaded here. This file provides us with pretrained glove word vectors that have been trained on Common Crawl data, a snapshot of the whole web.
  • movie_train.tsv: Needed only for the dataset shift portion of the code.

The trainig process for this file was done using an EC2 instance from AWS. Apart from that, the other code cells should run in under a few minutes in most laptops.

The best performance achieved on the validation set (which contained an equal number of samples from each class) was close to 90%.

Some examples of sentences and their classification:

image

Snapshot of the EC2 Training:

image

xgboost.ipynb

The following file should be available in the same directory:

The training process for this file was done both locally and in an instance of SageMaker from AWS. The rest of the cells provided should run in under five minutes on most laptops, and comments should provide the best hyperparameters we used (thus saving on GridSearch time).

The best performance achieved on a balanced testing set (which contained equal samples from each class) was about 84%.

Example of sentiment analysis on a short user-generated sentence for our best performing XGBoost model.

image

Snapshot of boosting rounds:

image

bert.ipynb

The following file should be available in the same directory:

  • Reviews.csv: This file can be downloaded from the Kaggle link above. The trainig process for this file was done using an EC2 instance from AWS. Apart from that, the data preprocessing cells into BERT tokens should take no more than 10 minutes.

The model was trained on a balanced data set sampled from an equal number of positive and negative reviews. The hyperparameters need to be further adjusted to provide better accuracy.

Example of training loop loss and accuracy:

image

About

Sentiment Analysis: This project is a deliverable for the CIS419 course at the University of Pennsylvania.


Languages

Language:Jupyter Notebook 100.0%