tmvien / Analysis_Yelp_Business_Public_Dataset

Analysis_on_Yelp_Business_Public_Dataset

Technology: AWS, S3, EMR
Language: PySpark
Code and Analysis

Project Description

In this project, we will use PySpark in AWS EMR to analyze the Yelp public datasets from Kaggle which are uploaded into AWS S3 bucket.

The datasets are almost 10GB and can be accessed as below:

business = spark.read.json('s3://cis9760-project-ii-mv/*business.json')
reviews = spark.read.json("s3://cis9760-project-ii-mv/*review.json")
user = spark.read.json("s3://cis9760-project-ii-mv/*user.json")

The notebook contains four parts with additional analysis.

Part I: Installation and Initial Setup
Part II: Analyzing Categories
Part III: Do Yelp Reviews Skew Negative?
Part IV: Should the Elite be Trusted?
Additional Analysis: The Percentage of Elite Reviews for Each Category

Cluster Configuration

Notebook Configuration

About