Welcome to the Movie Recommender System project! This project utilizes a content-based recommender system to suggest movies based on user preferences. The backend is hosted on PythonAnywhere, and the frontend is hosted on Netlify.
- What is a Recommender System?
- Types of Recommender Systems
- Project Flow
- Data Preprocessing
- Recommender System Logic
- TF-IDF and Cosine Similarity
- Folder Structure
- Running the Project
A recommender system predicts the preference of a user for a particular item. Popular platforms like Spotify, YouTube, Facebook, Instagram, and Netflix use recommender systems to enhance user experience by suggesting relevant content.
Content-based recommender systems suggest items similar to those a user has liked in the past, based on item features. Tags and keywords are used to capture the content similarity.
Collaborative filtering systems recommend items based on the preferences of similar users. If users A and B have similar tastes, a movie liked by A is likely to be recommended to B.
Hybrid recommender systems combine content-based and collaborative filtering approaches to leverage the strengths of both methods.
- Data Preprocessing: Clean and prepare the data.
- Model Building: Develop a model based on the preprocessed data.
- Frontend Development: Create a user-friendly interface.
- Backend Integration: Seamlessly integrate the model with the web application.
- Deployment: Host the application online.
The preprocessing steps are implemented in preprocessor.py
and include:
- Merging two datasets (
tmdb_5000_movies.csv
andtmdb_5000_credits.csv
) accessed from this Kaggle dataset. - Removing unwanted columns and renaming columns for clarity.
- Converting genres, keywords, and cast to lists of strings.
- Removing stopwords from the overview.
- Creating a combined tags column from genres, keywords, cast, and overview.
- Converting all tags to lowercase and combining them into a single string.
- Saving the processed data to
processed_movie_data.csv
.
# Example preprocessing function
def preProcessData(movie_data, creds_data):
# ... (code from preprocessor.py)
return movie_data
The logic of the recommender system is implemented in recommender.py
. Key functions include:
Converting strings to lowercase and removing special characters to standardize data.
def normalize_string(s):
return re.sub(r"[-'.,\s\t]", "", s).lower()
Fetching movie posters using The Movie Database (TMDb) API.
def fetch_poster(movie_id):
api_key = 'YOUR_API_KEY'
url = f'https://api.themoviedb.org/3/movie/{movie_id}?api_key={api_key}&language=en-US'
# ... (API request and error handling)
return poster_url
Finding the index of a movie in the DataFrame based on its title.
def find_movie_index(title, data):
normalized_title = normalize_string(title)
for idx, row in data.iterrows():
if normalize_string(row['title']) == normalized_title:
return idx
return None
Getting top N similar movies based on cosine similarity.
def get_similar_movies(title, data, similarity_matrix, top_n=10):
movie_idx = find_movie_index(title, data)
if movie_idx is None:
return None
sim_scores = list(enumerate(similarity_matrix[movie_idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
top_similar_movies = sim_scores[1:top_n + 1]
# ... (code to fetch movie details and posters)
return similar_movies
Recommending top N movies based on the query movie title.
def recommend_movies(query, data, similarity_matrix, top_n=10):
movie_idx = find_movie_index(query, data)
if movie_idx is not None:
recommendations = get_similar_movies(query, data, similarity_matrix, top_n)
return {'recommendations': recommendations, 'movie_found': True}
else:
top_movies = data.nlargest(top_n, 'score')[['title', 'movie_id', 'genres', 'score']]
# ... (code to fetch movie details and posters)
return {'recommendations': recommendations, 'movie_found': False}
Building the TF-IDF model and computing cosine similarity.
def build_recommendation_model(movie_data):
tfidf = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf.fit_transform(movie_data['tags']).toarray()
cosine_sim = cosine_similarity(tfidf_matrix)
return cosine_sim
TF-IDF (Term Frequency-Inverse Document Frequency) is used to convert textual data into numerical vectors, highlighting the importance of words in the documents. Cosine similarity measures the similarity between two vectors, indicating how similar two movies are based on their tags.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def build_recommendation_model(movie_data):
tfidf = TfidfVectorizer(max_features=5000)
tfidf_matrix = tfidf.fit_transform(movie_data['tags']).toarray()
cosine_sim = cosine_similarity(tfidf_matrix)
return cosine_sim
.
├── app.py
├── recommender.py
├── processed_movie_data.csv
├── client
│ ├── index.html
│ ├── styles.css
│ ├── script.js
├── requirements.txt
- Clone the Repository: Clone the repository to your local machine.
git clone <repository-url> cd <repository-directory>
- Install Dependencies: Install required Python packages.
pip install -r requirements.txt
- Download the Data: Download the
tmdb_5000_movies.csv
andtmdb_5000_credits.csv
files from this Kaggle dataset and place them in the project directory. - Preprocess Data: Run the preprocessing script to generate
processed_movie_data.csv
.python preprocessor.py
- Run the Flask App: Start the Flask server.
python app.py
- Access the Application: Open your browser and go to
http://localhost:6543
.