Amazon-Food-Reviews-Analysis-and-Modelling

Performed Exploratory Data Analysis, Data Cleaning, Data Visualization and Text Featurization(BOW, tfidf, Word2Vec). Build several ML models like KNN, Naive Bayes, Logistic Regression, SVM, Random Forest, etc.

Objective:

Given a text review, determine the sentiment of the review whether its positive or negative.

Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

Notebooks in more readable form on Jupyter Nbviewer[https://nbviewer.jupyter.org/github/cyanamous/Amazon-Food-Reviews-Analysis-and-Modelling/tree/master/]

About Dataset

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId - unique identifier for the product
UserId - unqiue identifier for the user
ProfileName
HelpfulnessNumerator - number of users who found the review helpful
HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
Score - rating between 1 and 5
Time - timestamp for the review
Summary - brief summary of the review
Text - text of the review

1 Amazon Food Reviews EDA, NLP and Text Preprocessing

Defined Problem Statement
Performed Exploratory Data Analysis(EDA) on Amazon Fine Food Reviews Dataset plotted Word Clouds, Distplots, Histograms, etc.
Performed Data Cleaning & Data Preprocessing by removing unneccesary and duplicates rows and for text reviews removed html tags, punctuations, Stopwords and Stemmed the words using Porter Stemmer
Documented the concepts clearly
Plotted TSNE plots for Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec

2 KNN

Applied K-Nearest Neighbour on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
Used both brute & kd-tree implementation of KNN
Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne

Performance Table

Conclusions:

Best Accuracy of 85.107% is achieved by Avg Word2Vec Featurization
The kd-tree and brute implementation of KNN gives relatively similar results
KNN is a very slow Algorithm compared to others takes alot of time to train
KNN did not fair in terms of precision and F1-score. Overall KNN was not that good for this dataset

3 Naive Bayes

Applied Naive Bayes using Bernoulli NB and Multinomial NB on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
Printed Top 25 Important Features for both Negative and Positive Reviews

Performance Table

Conclusions:

The best thing about Naive Bayes it much quicker than algorithms amazingly fast training times
Best Models are Bi-Gram with accuracy of 89.53% and precision of 0.594
Multinomial Naive Bayes does not work with negative values
Naive Bayes fails miserably with featurization of Word2Vec and tfidf Word2Vec as Word2Vec feature are completely dependent while Naive Bayes is based on assumption of feature independence

4 Logistic Regression

Applied Logistic Regression on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
Used both Grid Search & Randomized Search Cross Validation
Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
Showed How Sparsity increases as we increase lambda or decrease C when L1 Regularizer is used for each featurization
Did pertubation test to check whether the features are multi-collinear or not

Performance Table

Conclusions:

Features are multi-collinear i.e. they are co-related
Bigram Featurization performs best with accuracy of 93.704 and F1-Score of 0.808
Sparsity increases as we increase lambda or decrease C when L1 Regularizer is used
Algorithms like SVM & Logistic Regression performed best on this data

5 SVM

Applied SVM with rbf(radial basis function) kernel on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
Used both Grid Search & Randomized Search Cross Validation
Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
Evaluated SGDClassifier on the best resulting featurization

Performance Table

Conclusions:

Support Vector Machine(SVM) gave the best result better than other algos close to Logistic Regression
Tf-idf Featurization(C=1000,gamma=0.005) gave the best results with accuracy of 91.667% and F1-score of 0.733
SVM with RBF kernel the separating plane exists in another space - a result of kernel transformation of the original space. Its coefficients are not directly related to the input space. Hence we can't get the feature importance
Also tried SGDClassifier with the best result i.e. with tfidf featurization it was very quick and gave around same score in just seconds with accuracy of 91.04% and F1-score of 0.734 with (alpha=1e-05,penalty='l1')

6 Decision Trees

Applied Decision Trees on Different Featurization of Data viz. BOW(uni-gram,bi-gram), tfidf, Avg-Word2Vec(using Word2Vec model pretrained on Google News) and tf-idf-Word2Vec
Used both Grid Search with random 30 points for getting the best max_depth
Evaluated the test data on various performance metrics like accuracy, f1-score, precision, recall,etc. also plotted Confusion matrix using seaborne
Plotted feature importance recieved from the decision tree classifier

Performance Table

Conclusions:

Decision Trees on Uni-gram, bi-gram and tfidf would have taken forever if had taken all the dimensions as it had huge dimension and hence tried with max 300 as max_depth
Bi-gram Featurization(max_depth=73) gave the best results with accuracy of 85.11% and F1-score of 0.513
Plotted feature importance for Uni-gram, bi-gram and tfidf but not for Avg Word2Vec and Tfidf Avg Word2Vec as Word2Vec featurizations are highly co-related hence can't directly get the feature importance

gopi3e / Amazon-Food-Reviews-Analysis-and-Modelling

Amazon-Food-Reviews-Analysis-and-Modelling

Performed Exploratory Data Analysis, Data Cleaning, Data Visualization and Text Featurization(BOW, tfidf, Word2Vec). Build several ML models like KNN, Naive Bayes, Logistic Regression, SVM, Random Forest, etc.

Objective:

Notebooks in more readable form on Jupyter Nbviewer[https://nbviewer.jupyter.org/github/cyanamous/Amazon-Food-Reviews-Analysis-and-Modelling/tree/master/]

About Dataset

1 Amazon Food Reviews EDA, NLP and Text Preprocessing

2 KNN

Performance Table

Conclusions:

3 Naive Bayes

Performance Table

Conclusions:

4 Logistic Regression

Performance Table

Conclusions:

5 SVM

Performance Table

Conclusions:

6 Decision Trees

Performance Table

Conclusions:

About

Languages