showkatewang / Amazon_Vine_Analysis

Analyzes book review data from Amazon and the Amazon-Vine program utilizing PySpark and Amazon Web Service's Relational Database Service (AWS RDS)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

The purpose of this project is to analyze a dataset of book reviews from Amazon. To this end, we utilize PySpark to extract, transform, and load the data into pgAdmin while connecting to Amazon Web Service's Relational Database Service (AWS RDS) instance. We will also use PySpark to ascertain whether the paid Amazon Vine program members leave more positive reviews based on the dataset.


Results

  • There were 5,012 Vine reviews;
  • There were 109,297 non-Vine reviews.

paid_reviews

unpaid_reviews

  • 2,031 Vine reviews were five stars;
  • 49,967 non-Vine reviews were five stars.

paid_5_stars

unpaid_5_stars

  • Approximately 40.52% of Vine reviews were five stars;
  • Approximately 45.72% of non-Vine reviews were five stars.

percentages

At-a-Glance

Vine Reviews Non-Vine Reviews
Total Reviews 5,012 109,297
Number of Five Stars 2,031 49,967
Percentage of Five Stars 40.52% 45.72%

Summary

Based on the calculations above, positivity bias from members of the Vine program is unlikely. The percentage of five-star Vine reviews was comparable to the percentage of five-star non-Vine reviews. Additional analysis could determine the distribution of star ratings by calculating the percentages of Vine and non-Vine reviews at each star rating.


Resources

Data Source:

https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz

Software:

AWS RDS
Google Colaboratory Notebook
Apache Spark
PySpark
Python 
pgAdmin
Hadoop
MapReduce
mrjob

Contact

Email: show.wang94@gmail.com

LinkedIn: https://www.linkedin.com/in/s-k-wang

About

Analyzes book review data from Amazon and the Amazon-Vine program utilizing PySpark and Amazon Web Service's Relational Database Service (AWS RDS)

License:MIT License


Languages

Language:Jupyter Notebook 100.0%