Recommendation Systems

Note: The Recommendation System will utilize the data from yelp.com

train_review.json – the main file that contains the review data, RS will primarily be working with this file.
test_review.json – containing only the target user and business pairs for prediction tasks
test_review_ratings.json – containing the ground truth rating for the testing pairs
stopwords - containing common stopwords that will be used when calculating TFIDF score.
The file is preprocessed first using Apache Spark

Collaborative Filtering: Collaborative Filtering Recommendation System that has two cases: Item-based CF and User-based CF.
1. Item-based CF: the RS is built by computing the Pearson correlation for the business pairs with at least three co-rated users and use 3 or 5 neighbors who are most similar to targeted business.
2. User-based CF: MinHash and LSH is used first to identify similar users to reduce the number of pairs needed to compute Pearson Correlation. After identifying the similar users based on their jaccard similarity, RS will compute the Pearson Correlation for all candidates user pairs and make the prediction.
Content-Based Recommendation Sys: The content-based RS which will generate profiles from review texts for users and businesses in the train_review.json file. Algorithms used are: Calculation of TF-IDF score and Cosine Similarity.
Finding Similar Items: Find similar business pairs in the train_review.json file. Algorithms used are: MinHash and Locality Sensitive Hashing, Jaccard Similarity
Hybrid Recommendation Sys: The hybrid recommendation system that utilizes several different models and produce the best result jointly. This project is also ranked the third place at USC Data Mining (Recommendation System) Competition 2021 with final score of 2709 and RMSE of 1.1498

Similar Items:
- b1 and b2 are the business id
- sim is the jaccard similarity of b1 and b2
Content-based RS:
- user_id and business_id pair means 'if a user would prefer to review a business'
- sim is the calculated(predicted) cosine distance between the profile vectors.
User-based CF Pearson Correlation Model:
- u1 and u2 are the user id
- sim is the Pearson Correlation between these two users
Item-based CF Pearson Correlation Model:
- b1 and b2 are the business id
- sim is the Pearson Correlation between these two business
CF prediction result:
- user_id and business_id stands for 'this user will likely rate this business with this star'
- stars is simply the predicted rating

Similar business pairs
1. precision: 1.0
2. recall: 0.9582400942205771
Content-based RS
1. precision (test set): 1.0
2. recall (test set): 0.999469477863536
CF model
1. item-based CF model
  1. precision: 0.9641450981844213
  2. recall: 0.9805068470797926
2. user-based CF model
  1. precision: 0.9573746593617223
  2. recall: 0.8276633759390503
CF prediction
1. item-based RMSE (test set): 0.9023539405054186
2. user-based RMSE (test set): 0.9901023647008427
Hybrid Recommendation System:
1. Blind test set RMSE: 1.1498
2. Test set RMSE: 1.14166

About

Hybrid RecSys, CF-based RecSys, Model-based RecSys, Content-based RecSys, Finding similar items using Jaccard similarity

Language:Python 59.9%Language:Jupyter Notebook 40.1%