machine-learning machinelearning-python machine-learning-algorithms tf-idf text-processing text-vectorization data-preprocessing data-cleaning feature-extraction knn-classification svm neural-network tensorflow sklearn regression classification multinomial-naive-bayes logistic-regression linear-regression parameter-tuning

Yelp Business Stars’ Rating Prediction

https://colab.research.google.com/drive/1q5rvPOO8DvD8DV5DNLMVc8UDY7ntWHah

Tradition (Standard) AI Models : KNN | SVM | Logistic Regression | Multinomial Naive Bayes | Linear Regression

Deep Learning Models : Neural Network ( Regression & Classification )

Problem statement

Predicting the review stars from 1-5 star ratings based on the review given by the user.

Machine Learning project aims

learn text vectorization (IF-IDF)
big data handling & preprocess the data
merging two big datasets
treat problem as rgression and classification, observe it
Apply and compare tradition AI models with Deep Learning Nueral Network

Tools and Libraries used

sklearn
TensorFlow
Numpy
Pandas

Dataset

https://www.yelp.com/dataset/download

Load dataset

The data containing json files was converted to a compatible file to load on pandas’ data frame.Used business. json and review.json files to understand the dataset. Grouped the multiple reviews on bussiness_id to get all reviews given by the user into one text.

Merged the datasets with on BusinessID and got the final dataset shape as below

Data Pre-Processing/ Cleaning

Dropped the rows with categories that have null values
Filtered the data frame more by removing rows with business Ids having review count less than a certain threshold
Cleaned the reviews text data by removing stop words, punctuations and white spaces.
Used TF-IDF vectorization for Feature Extraction and used its parameters
Performed label encoding on the “stars” column (Output Feature)
Normalized the “ Review_count “ Column to make it comparable with min-max normalization

# TF-IDF Vectorization - Feature Extraction
import sklearn.feature_extraction.text as sk_text
Tfidf_vectorizer = sk_text.TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1), min_df = .05 , max_df = .85)

Splitting the data

Split the data into 80% train and 20% test

Regression Model

Linear Regression

Neural Network Using Tensorflow

Used earlystopping to prevent overfitting the model and used checkpointer to save the best model ran in the loop several time to jump out of the local mininum.

Applied paramter tuning by changing following:

Activation function : relu, sigmoid,tanh
Number of Dense Layers
Number of Neurons in each layer
Learning rate for Activation
Optimizer : SGD, Adamax, Adam, Adagrad

Comparison

Classification Model

Logistic Regression

SVM

KNN

MNB

Boost up Performances

Output feature - review ratings categorised into categories as high, low and medium to boost the performance of the above applied model and it significantly boosts the performance

KNN

Logistic Regression

SVM

Neural Network Using Tensorflow

Used earlystopping to prevent overfitting the model and used checkpointer to save the best model ran in the loop several time to jump out of the local mininum.

Applied paramter tuning by changing following:

Activation function : relu, sigmoid,tanh
Number of Dense Layers
Number of Neurons in each layer
Learning rate for Activation
Optimizer : SGD, Adamax, Adam, Adagrad

Boost up Performances

Output feature - review ratings categorised into categories as high, low and medium to boost the performance of the above applied model and it significantly boosts the performance

Also applied Grid Search to get the best optimizer using keras wrappers library. This gives the best optimizer from given list for best performing model so far with accuracy, this all boost up the performance and beats the standard AI classification models.

Comparison

Comparing the NN with previously best performed Logistic Regression model

comparing all classification models

Observing all the F1 score, clearly the NN performs better than all other models such as Logistic Regression, SVM, KNN and MNB.

Mini Project 1 & 2

Mansi Patel

February 13, 2019

Prof : H. Chen

Class : CSC 215-01

About

The project has text vectorization, handling big data with merging and cleaning the text and getting the required columns while boosting the performance by feature extraction and parameter tuning for NN, compares the Performances through applied different models treating the problem as classification and regression both.

https://colab.research.google.com/drive/1q5rvPOO8DvD8DV5DNLMVc8UDY7ntWHah

machine-learning machinelearning-python machine-learning-algorithms tf-idf text-processing text-vectorization data-preprocessing data-cleaning feature-extraction knn-classification svm neural-network tensorflow sklearn regression classification multinomial-naive-bayes logistic-regression linear-regression parameter-tuning

Languages

Language:Jupyter Notebook 100.0%