Implementing a Recommender system using Matrix Factorization Collaborative Filtering
In this project, our goal is to recommend top 5 movies to a user, based on Matrix Factorization, using MovieLens 20M dataset. You can download the dataset from kaggle. Four steps are taken through this project. Corresponding (.py) files should run in this order:
- Preprocess the data (preprocess.py)
- Data analysis (analyzie.py)
- Create model (learning.py)
- Predict user rating (predict.py)
Since processing 20 million ratings takes a lot of time, we will use a subset of dataset. So our first step is to shrink data into a reasonable amount by choosing most common user and movies. Then, an id-correction is needed in order to fill dataset with identifiers starting from 0 to N-1. Finally, we will shuffle the data and divide dataset into training and test data. The result is shown as below:
A distribution of important data such as rating, movie genres and publication year of movies is plotted for better data understanding.
In this section, model will be created. Later we will plot results of our loss function, which is Mean Squarred Error (MSE) in this project. The model will be trained within 25 epochs.
After 25th epoch:
For a specific user, ratings over unseen movies will be generated. Then we will recommend top 5 movies that user might like.