Aakaaaassh / PySpark-GCP-Movies-Data-Analysis

Repository from Github https://github.comAakaaaassh/PySpark-GCP-Movies-Data-AnalysisRepository from Github https://github.comAakaaaassh/PySpark-GCP-Movies-Data-Analysis

Movie Data Analysis with PySpark on Google Cloud Dataproc with Hadoop

This project analyzes movie data using Apache Spark on Google Cloud Dataproc with Hadoop. It includes various tasks such as calculating aggregated ratings, identifying movie characteristics, and exploring user behavior.

Requirements

  • Google Cloud Platform (GCP) account
  • Google Cloud Dataproc
  • Google Cloud Storage (GCS) bucket
  • JupyterLab on Google Cloud Dataproc (optional)
  • Python (3.x recommended)
  • Apache Spark
  • PySpark
  • Hadoop
  • YARN

Installation

  1. Set up your Google Cloud environment with Dataproc and create a GCS bucket to store data and output files.
  2. Configure your Dataproc cluster with the necessary components, including Hadoop and YARN.
  3. Upload the provided datasets (movies.csv, ratings.csv, tags.csv) to your GCS bucket.
  4. Launch a PySpark notebook on Google Cloud Dataproc JupyterLab.
  5. Copy the content of the movie_data_analysis.py script into your PySpark notebook.

Usage

  1. Ensure that your PySpark notebook is connected to your Dataproc cluster.
  2. Run the Spark jobs provided in the notebook to solve the problem statements.
  3. The output of each task will be stored as a CSV file in your GCS bucket.

Project Structure

  • datasets/: Contains the input data files (movies.csv, ratings.csv, tags.csv).
  • movie_data_analysis.py: Contains the PySpark code for data analysis tasks.
  • output/: Stores the output CSV files generated by the Spark jobs.

Problem Statements

  1. Aggregated number of ratings per year
  2. Average monthly number of ratings
  3. Rating levels distribution
  4. Movies that are tagged but not rated
  5. Movies that have a rating but no tag
  6. Top 10 movies with rated untagged movies, focusing on average rating and number of ratings
  7. Average number of tags per movie and per user
  8. Users that tagged movies without rating them
  9. Average number of ratings per user and per movie
  10. Predominant genre per rating level
  11. Predominant tag per genre and most tagged genres
  12. Most predominant (popularity-based) movies
  13. Top 10 movies in terms of average rating (provided more than 30 users reviewed them)

License

This project is licensed under the MIT License.

About

License:MIT License


Languages

Language:Jupyter Notebook 100.0%