Niraj-Khatri / Video_Game_Reviews

This project showcased the ETL process of big data. Raw data about Amazon video games reviews was collected from a site, placed into an AWS database, and queried against using Pyspark and SQL to find out whether Amazon vine reviews influenced customer feedback.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pyspark-AWS Project

The goal of this project was to extract Amazon product review data, clean the data, and load the data to a Postgres database using AWS RDS. Afterwards, I did analysis on the data to determine whether a certified Amazon vine reviewer provided more helpful reviews than a non-vine reviewer.

ETL

I extracted Amazon video game review data from the following site: https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Video_Games_v1_00.tsv.gz Extract


I cleaned the data and created 4 tables to do future analysis with: customers, products, reviews, and vines.

Cleaning


I created a AWS RDS instance and used an SQL script to create the 4 tables in Postgres. AWS

Postgres


With PySpark, I loaded the data tables to Postgres.

Upload

Data Analysis

I wanted to analyze the Amazon video game data to determine if Amazon vine reviewers provided more helpful reviews.


First, using the vine table, I filtered out reviews that had less than 50% of the helpful votes and reviews with less than 20 total votes.

Filter


Next, I calculated the number of vine reviews and non-vine reviews in the filtered data set.

Vine


Finally, I wanted to look at top products (5 stars). I filtered out the data set for five star reviews only and calculated the percentage of 5 star reviews among vine and non-vine reviews.

5Stars

Conclusion: Vine reviewers gave a product 5 stars half the time a vine review was found helpful. This is 10% more compared to non-vine reviews. This may suggest 5 star reviews are found more helpful overall since they instill confidence in the reader to buy the product.

About

This project showcased the ETL process of big data. Raw data about Amazon video games reviews was collected from a site, placed into an AWS database, and queried against using Pyspark and SQL to find out whether Amazon vine reviews influenced customer feedback.


Languages

Language:Jupyter Notebook 100.0%