diannejardinez / big-data-challenge

Big Data ETL with PySpark and Amazon's RDS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ETL on Big Data - PySpark & AWS's Relational Databases - Analysis

Level 1

Objective: Perform ETL process completely in the cloud and upload a DataFrame to an RDS instance by creating 4 DataFrames to match production-ready tables from two big Amazon customer review datasets

ETL: Outdoors Dataset

  • Extracted: 2,302,173 number of records
  • Transformed
  • Load

ETL: Digital Video Games Dataset

  • Extracted: 145,431 number of records
  • Transformed
  • Load

Level 2

Objective: Analyze whether reviews from Amazon's Vine program are trustworthy.

Outdoors Dataset

Overall Analysis: Amazon's Vine program reviews should be met with caution. Though customers in this program may be small, their reviews are mirrored by Non-Vine customers(when looking at extreme rating scores of 5 and 1.) In-depth Analysis only on Outdoors dataset was performed due to Digital Video Games not having any Vine customers documented.

Summary Analysis:

Vine Customer Non-Vine Customer
Number 3137 2299036
Percentage 0.14% 99.86%
Average star rating 4.37 4.24
Top Product of VC* --- Same score as VC
Low Product of VC* --- Same score as VC
Top Product of NV* Not purchased ---
Low Product of NV* Not rated/purchased ---
Reviews created* 1-2 ~200+

*VC: Vine Customer, NV: Non-Vine Customer. Customer reviews from the top three helpful voting scores from 1 and 5 star ratings

Additional:

  • Vine customers with the top helpful votes provided 1-2 reviews with their first as the highest, and second as lowest. Non-Vine customers made more reviews. These results match the intention of the Vine program where there is a limited number of members and their rank/ability to stay as a Vine member is by their helpful votes score

  • Products from top and low star ratings from Non-Vine customers were not found in product reviews from Vine customers

  • Data for star ratings and Vine/Non-Vine customers only looked into three products with the highest helpful votes for each type of customer. This was executed due to helpful votes being a quantifiable metric where customers can gauged as to whether a review seems to be an accurate description of the product and the product experience

  • Future research would be to examine "Amazon's Choice" and "Best Seller" products and whether this statement actually influences ratings or product purchase

About

Big Data ETL with PySpark and Amazon's RDS

License:MIT License


Languages

Language:Jupyter Notebook 100.0%