Welcome to the Movie Insights Project! In this analysis, we will explore a dataset of movies to gain insights into various aspects such as popular genres, movie properties associated with high revenues, average ratings for each director, and the top-rated movie titles and their directors. By examining these factors, we aim to provide valuable insights into the movie industry and inform decision-making processes in the production and distribution of movies.
To perform the analysis, we will utilize several libraries for data manipulation, analysis, and visualization.
We will start by importing these libraries and loading the movie dataset (tmdbmovies.csv
). We will then explore the data to understand its structure, contents, and any potential data quality issues.
-
Handled missing values and outliers.
-
Cleaned and transformed data for analysis.
-
Check for duplicate data
-
Getting overall statistics about the dataframe
To determine the most popular genres from year to year, we will analyze the frequency of each genre in the dataset. By grouping the data by release year and genre and counting the occurrences of each genre in each year, we can identify the genres with the highest occurrence in each year. These genres represent the most popular ones during those periods.
To identify the properties associated with movies that have high revenues, we will analyze the relationship between revenue and various movie properties. Using statistical methods we can determine which properties have a significant impact on revenue. Some key properties to consider include budget, popularity, runtime, vote average, and vote count.
To determine the average rating for each director, we will group the movie data by director and calculate the mean rating for each group. This will allow us to assess the performance of directors based on the average rating of their movies. By identifying directors with consistently high ratings, we can gain insights into the quality of their work.
To find the top 10 highest-rated movie titles and their directors, we will sort the movie data by rating in descending order. This will enable us to identify the movies with the highest ratings and their corresponding directors. By analyzing these top-rated movies, we can uncover patterns and trends associated with critically acclaimed films.
Through this analysis, I aim to provide valuable insights into the movie industry by exploring popular genres, movie properties associated with high revenues, average ratings for each director, and the top-rated movie titles and their directors.
-
We perform linear regression on the movie database based on various features.
-
After training the model, we calculate the mean squared error (MSE) and R-squared to evaluate its performance.
- Pandas
- NumPy
- Matplotlib
- Seaborn
- Scikit-learn
- Install the required dependencies using
pip install
. - Run on either Pycharm or Anaconda Jupyter notebook.
- Hiro Worku Gomoro
- Id No: DBUR/1400/12