Movie Data Analysis with PySpark on Google Cloud Dataproc with Hadoop

This project analyzes movie data using Apache Spark on Google Cloud Dataproc with Hadoop. It includes various tasks such as calculating aggregated ratings, identifying movie characteristics, and exploring user behavior.

Requirements

Google Cloud Platform (GCP) account
Google Cloud Dataproc
Google Cloud Storage (GCS) bucket
JupyterLab on Google Cloud Dataproc (optional)
Python (3.x recommended)
Apache Spark
PySpark
Hadoop
YARN

Installation

Set up your Google Cloud environment with Dataproc and create a GCS bucket to store data and output files.
Configure your Dataproc cluster with the necessary components, including Hadoop and YARN.
Upload the provided datasets (movies.csv, ratings.csv, tags.csv) to your GCS bucket.
Launch a PySpark notebook on Google Cloud Dataproc JupyterLab.
Copy the content of the movie_data_analysis.py script into your PySpark notebook.

Usage

Ensure that your PySpark notebook is connected to your Dataproc cluster.
Run the Spark jobs provided in the notebook to solve the problem statements.
The output of each task will be stored as a CSV file in your GCS bucket.

Project Structure

datasets/: Contains the input data files (movies.csv, ratings.csv, tags.csv).
movie_data_analysis.py: Contains the PySpark code for data analysis tasks.
output/: Stores the output CSV files generated by the Spark jobs.

Problem Statements

Aggregated number of ratings per year
Average monthly number of ratings
Rating levels distribution
Movies that are tagged but not rated
Movies that have a rating but no tag
Top 10 movies with rated untagged movies, focusing on average rating and number of ratings
Average number of tags per movie and per user
Users that tagged movies without rating them
Average number of ratings per user and per movie
Predominant genre per rating level
Predominant tag per genre and most tagged genres
Most predominant (popularity-based) movies
Top 10 movies in terms of average rating (provided more than 30 users reviewed them)

License

This project is licensed under the MIT License.

Aakaaaassh / PySpark-GCP-Movies-Data-Analysis