Movie-Dataset-Analysis-using-PySpark

Project Overview

This project focuses on exploratory data analysis (EDA) and visualization of a movie dataset using PySpark. The project aims to answer specific questions related to the dataset, providing insights into various aspects of the movies it contains.

Questions Addressed:

Column Composition:
- What columns are present in the loaded datasets?
Number of Movies:
- How many movies are included in the provided dataset?
Number of Users:
- How many users have provided ratings in the dataset?
Missing Data:
- Are there any missing values in the dataset?
Movies without Ratings:
- How many movies lack ratings, and which ones are they?
Best-Rated Movie:
- Which movie has the highest average rating? In case of ties, consider the one with the most votes.
Percentage of Top-Rated Movies:
- What percentage of movies have only maximum ratings?
Movie with Highest Minimum Rating:
- Which movie has the highest minimum rating? In case of ties, consider the one with the most votes.
Distribution of Ratings:
- What is the distribution of ratings?
Documentary Films:
- How many movies are classified as 'documentary'?
Best-Rated Documentary with 10+ Votes:
- Which documentary movie with at least 10 votes has the highest average rating?
Yearly Movie Count Differences:
- What are the differences in the number of movies each year? Assume the timestamp represents seconds since 1960.
Average Categories per Movie:
- What is the average number of categories assigned to a movie? Which movie has the most categories, and what are they?

Feel free to explore the code, adapt it to other datasets, and enhance the analysis as needed. Happy exploring!

About

This project focuses on exploratory data analysis (EDA) and visualization of a movie dataset using PySpark. The project aims to answer specific questions related to the dataset, providing insights into various aspects of the movies it contains.

GNU General Public License v3.0

Languages

Language:Jupyter Notebook 100.0%