tmvien / Analysis_Yelp_Business_Public_Dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Analysis_on_Yelp_Business_Public_Dataset

Project Description

In this project, we will use PySpark in AWS EMR to analyze the Yelp public datasets from Kaggle which are uploaded into AWS S3 bucket.

The datasets are almost 10GB and can be accessed as below:

business = spark.read.json('s3://cis9760-project-ii-mv/*business.json')
reviews = spark.read.json("s3://cis9760-project-ii-mv/*review.json")
user = spark.read.json("s3://cis9760-project-ii-mv/*user.json")

The notebook contains four parts with additional analysis.

  • Part I: Installation and Initial Setup
  • Part II: Analyzing Categories
  • Part III: Do Yelp Reviews Skew Negative?
  • Part IV: Should the Elite be Trusted?
  • Additional Analysis: The Percentage of Elite Reviews for Each Category

Cluster Configuration

cluster

Notebook Configuration

notebook

About