shouvik19 / Analysis-and-Recommendations-on-YELP-Dataset

Analysis and Recommendations on YELP Dataset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ARYD - Big Data Programming Project

Analysis and Recommendation on YELP dataset

Objective:

To provide useful insights using YELP dataset for businesses through big data analytics to determine strengths and weaknesses, so that existing owners and future business owners can make decision on new businesses or business expansion. Also to provide recommendation to both business owners and users by extensive analysis on data.

Project Overview:

The project involves analysis on the dataset, visualization based on analysis and recommendations. Major modules of the project are

  1. Validation of reviews on businesses based on user information.
  2. Classification of positive and negative reviews using Machine Learning techniques.
  3. Recommending location based “buzzwords” to future business owners by analyzing positive reviews and negative reviews for a businesses in a state.
  4. User-specific recommendations using user’s history of availed services. Recommendations are provided based on categories of the services, location of the business, user reviews and user ratings.

Analysis was done on the dataset to understand correlation between different metrics like - location of business and its success, etc. Analysis on business trends based on location, ratings, category and attributes of the business was performed. Trends of closed businesses was observed using user reviews and ratings.

Few visualizations for the project were done using python libraries and are stored in visualization folder. Remaining visualizations were done using tableau and the same can be viewed here. View in full screen for better experience.

Steps for execution:

Dataset for the project should be downloaded from Yelp dataset challenge and stored in yelp-dataset folder. The codes should be executed in the order specified in order_of_exec file.

Files:

-- business location - outliers removed using euclidean distance from avg location of businesses in state (Data Cleaning)

-- users's location -- user validation score

-- classification of reviews (Machine Learning)

-- joined classes to reviews and dropped not so useful columns

-- location based recommendations -- category based recommendations -- overall recommendations

-- most availed category of business by an user -- average stars given by user for each category -- number of positive and negative reviews given by a user

-- chose top 10 positive and top 10 negative reviews based on validation score for business with maximum reviews

-- average review count and stars by city and category -- average review count and stars by state and category -- business attribute based analysis -- average stars for open and closed businesses -- top 15 business categories -- top 15 business categories - city-wise -- cities with most businesses -- businesses with more 5 star ratings

-- top 20 restaurants on yelp (viz) -- restaurants with most funny, cool, useful reviews (viz)

-- topic modeling using positive reviews for businesses in Pennsylvania

-- topic modeling using negative reviews for businesses in Ontario

-- extracted terms and topics from the model saved from topic modeling

-- most frequent words from tips and review for Earl (viz) -- most frequent words from tips and review for Ontario (viz) -- most frequent words from tips and review for top 20 restaurants (viz) -- most frequent words from tips and review for bottom 20 restaurants (viz)

-- wordcloud NGrams from tips review -- wordcloud NGrams from tips review for Arizona

-- converting parquet ETLed files to JSON format for visualization purposes

Folders:

-- outputs after classification of reviews and etl steps on datasets will be stored

-- outputs of all the visualizations will be stored here -- tableau workbook having visualizations on the analysis was stored here

-- all results of topic modelling will be saved here

-- all results of analysis will be stored here

About

Analysis and Recommendations on YELP Dataset


Languages

Language:Python 100.0%