ARYD - Big Data Programming Project

Analysis and Recommendation on YELP dataset

Objective:

To provide useful insights using YELP dataset for businesses through big data analytics to determine strengths and weaknesses, so that existing owners and future business owners can make decision on new businesses or business expansion. Also to provide recommendation to both business owners and users by extensive analysis on data.

Project Overview:

The project involves analysis on the dataset, visualization based on analysis and recommendations. Major modules of the project are

Validation of reviews on businesses based on user information.
Classification of positive and negative reviews using Machine Learning techniques.
Recommending location based “buzzwords” to future business owners by analyzing positive reviews and negative reviews for a businesses in a state.
User-specific recommendations using user’s history of availed services. Recommendations are provided based on categories of the services, location of the business, user reviews and user ratings.

Analysis was done on the dataset to understand correlation between different metrics like - location of business and its success, etc. Analysis on business trends based on location, ratings, category and attributes of the business was performed. Trends of closed businesses was observed using user reviews and ratings.

Few visualizations for the project were done using python libraries and are stored in visualization folder. Remaining visualizations were done using tableau and the same can be viewed here. View in full screen for better experience.

Steps for execution:

Dataset for the project should be downloaded from Yelp dataset challenge and stored in yelp-dataset folder. The codes should be executed in the order specified in order_of_exec file.

Files:

business_etl.py

-- business location - outliers removed using euclidean distance from avg location of businesses in state (Data Cleaning)

user_etl.py

-- users's location -- user validation score

review_classification.py

-- classification of reviews (Machine Learning)

review_etl.py

-- joined classes to reviews and dropped not so useful columns

user_recomm.py

-- location based recommendations -- category based recommendations -- overall recommendations

user_analysis.py

-- most availed category of business by an user -- average stars given by user for each category -- number of positive and negative reviews given by a user

top_reviews.py

-- chose top 10 positive and top 10 negative reviews based on validation score for business with maximum reviews

business_analysis.py

-- average review count and stars by city and category -- average review count and stars by state and category -- business attribute based analysis -- average stars for open and closed businesses -- top 15 business categories -- top 15 business categories - city-wise -- cities with most businesses -- businesses with more 5 star ratings

restaurant_analysis.py

-- top 20 restaurants on yelp (viz) -- restaurants with most funny, cool, useful reviews (viz)

topic_mod_pos.py

-- topic modeling using positive reviews for businesses in Pennsylvania

topic_mod_neg.py

-- topic modeling using negative reviews for businesses in Ontario

topics.py

-- extracted terms and topics from the model saved from topic modeling

word_cloud.py

-- most frequent words from tips and review for Earl (viz) -- most frequent words from tips and review for Ontario (viz) -- most frequent words from tips and review for top 20 restaurants (viz) -- most frequent words from tips and review for bottom 20 restaurants (viz)

ngram_word_cloud.py

-- wordcloud NGrams from tips review -- wordcloud NGrams from tips review for Arizona

converttojson.py

-- converting parquet ETLed files to JSON format for visualization purposes

shouvik19 / Analysis-and-Recommendations-on-YELP-Dataset