- Introduction
- Recommendation System Pipeline
- Data Collection
- Data Pre-Processing
- Recommeder Model
- Result
- Summary
- References
You might have heard the term “Recommendation System (RS)” when YouTubers are discussing the latest tactics to get more views or when you or your friends compare the “Recommended for you” list on Netflix. In a nutshell, recommendation systems recommend things that the people might like based on your own watch history or you and friends watch history as a collective. From a non-Data Science practitioner’s perspective, that’s pretty much all you need to know about RS… Oh, but you ARE a Data Science practitioner. If that’s the case, we need to dig a little bit deeper into the ideas and code behind a recommendation system.
Before we deep dive into the RS and the code, let’s take a step back and clarify some questions:
What is a recommendation system? As I have mentioned before, the goal of a recommendation algorithm is to recommend or predict items a user might like based on their data or based on the entire user database. We will talk about precisely how to do this and the different types of RS later on, but for now, here’s a conceptual pipeline to show the process of recommending a song.
Why should we use a recommendation for song prediction tasks? The other question is why recommendation systems are used in a song prediction task. As you read in Part II, it is possible to use a cluster-based algorithm to predict the songs, however, it lacks the flexibility to add other features to the system, such as a classification predictor. In other words, a clustered-based algorithm is one type of recommendation system. However, compared to the two other types of RS introduced below, a cluster-based algorithm lacks flexibility. In fact, Both content-based filtering and collaborative filtering can include the clustering outcome into the models, creating a hybrid RS.
Spotify keeps a lot of data on its songs internally, that we can access through the Spotify API. The Spotify API is a great public tool, allowing the use of Spotify’s wealth of data on music to build many kinds of systems. I used this API through Python’s Spotipy package to extract data from unique song identifiers.
Data Cleaning
I used content-based filtering algorithm to make recommendations
Two steps were implemented in the algorithm. These two steps are needed every time some enters a new playlist query:
- Playlist Summarization:
Added all the feature values of each song in the playlist together to create a summarization vector
- Similarity and Recommendation:
After retrieving the playlist summarized vector and the non-playlist songs, we can find the similarity between each individual song in the database and the playlist. The similarity metric chosen is cosine similarity.
Cosine similarity is a mathematical value that measures the similarities between vectors. Imagining our songs vectors as only two-dimensional, the visual representation would look similar to the figure below.
I used the cosine_similarity() function from scikit learn to measure the similarity between each song and the summarized playlist vector.
One of the biggest problems with this model is that there are almost no metrics to evaluate whether the recommendation is good or bad. Since in this project, a clustering-based approach was conducted in Part II. We’ve decided to compare the two recommendations to see if there is any consensus or not.
Comparison
Top 20 songs in Mom’s playlist (left) and the recommended result by two recommendation algorithms (right). Here, the two methods are the content-based filtering approach and the clustering method using K-nearest neighbors.
In this project, I first learned the basics of recommendation systems, involving the two general approaches of content-based filtering and collaborative filtering methods. Then, built a content-based filtering recommendation system using Spotify playlists data from scratch. This involves the procedure of feature preprocessing, feature generation, and recommendation modeling. Lastly, I compared the recommendation with other song recommendation systems, specifically, the clustering methodology, and see the difficulties in measuring success for recommendation systems in an open-source environment.