This project analyzes movie data using Apache Spark on Google Cloud Dataproc with Hadoop. It includes various tasks such as calculating aggregated ratings, identifying movie characteristics, and exploring user behavior.
- Google Cloud Platform (GCP) account
- Google Cloud Dataproc
- Google Cloud Storage (GCS) bucket
- JupyterLab on Google Cloud Dataproc (optional)
- Python (3.x recommended)
- Apache Spark
- PySpark
- Hadoop
- YARN
- Set up your Google Cloud environment with Dataproc and create a GCS bucket to store data and output files.
- Configure your Dataproc cluster with the necessary components, including Hadoop and YARN.
- Upload the provided datasets (
movies.csv
,ratings.csv
,tags.csv
) to your GCS bucket. - Launch a PySpark notebook on Google Cloud Dataproc JupyterLab.
- Copy the content of the
movie_data_analysis.py
script into your PySpark notebook.
- Ensure that your PySpark notebook is connected to your Dataproc cluster.
- Run the Spark jobs provided in the notebook to solve the problem statements.
- The output of each task will be stored as a CSV file in your GCS bucket.
- datasets/: Contains the input data files (
movies.csv
,ratings.csv
,tags.csv
). - movie_data_analysis.py: Contains the PySpark code for data analysis tasks.
- output/: Stores the output CSV files generated by the Spark jobs.
- Aggregated number of ratings per year
- Average monthly number of ratings
- Rating levels distribution
- Movies that are tagged but not rated
- Movies that have a rating but no tag
- Top 10 movies with rated untagged movies, focusing on average rating and number of ratings
- Average number of tags per movie and per user
- Users that tagged movies without rating them
- Average number of ratings per user and per movie
- Predominant genre per rating level
- Predominant tag per genre and most tagged genres
- Most predominant (popularity-based) movies
- Top 10 movies in terms of average rating (provided more than 30 users reviewed them)
This project is licensed under the MIT License.