Project: Investigating TMDB Movie Dataset

Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions

Introduction

Overview

With the use of TMDB movie dataset which contains information about 10,000 movies, including user ratings and revenue, we are gonna investigate this dataset in order to answer some questions about it and extract some conclusions.

Questions

- Which movies made maximum and minimum and minimum profits?
- Who is most movie director?
- In which year there was most profit?
- What is most geners in movies?
- What is the relation between profits over years?

# Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = pd.read_csv('tmdb-movies.csv')

Data Wrangling

General Properties

# The first row of df
df.head(1)

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09

1 rows × 21 columns

df.describe()

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10866.000000	10866.000000	1.086600e+04	1.086600e+04	10866.000000	10866.000000	10866.000000	10866.000000	1.086600e+04	1.086600e+04
mean	66064.177434	0.646441	1.462570e+07	3.982332e+07	102.070863	217.389748	5.974922	2001.322658	1.755104e+07	5.136436e+07
std	92130.136561	1.000185	3.091321e+07	1.170035e+08	31.381405	575.619058	0.935142	12.812941	3.430616e+07	1.446325e+08
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.250000	0.207583	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20669.000000	0.383856	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75610.000000	0.713817	1.500000e+07	2.400000e+07	111.000000	145.750000	6.600000	2011.000000	2.085325e+07	3.369710e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

Data Cleaning

Drop Extraneous Columns

extraneous_columns = ['id', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview', 'budget_adj',
       'revenue_adj', 'vote_count', 'vote_average', 'production_companies', 'cast']
df.drop(extraneous_columns, axis=1, inplace=True)

df.head(1)

	popularity	budget	revenue	original_title	director	runtime	genres	release_date	release_year
0	32.985763	150000000	1513528810	Jurassic World	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	6/9/15	2015

# Dataframe for dates
dates = df[['release_year', 'release_date']].copy()
dates.head(1)

	release_year	release_date
0	2015	6/9/15

Date Foramt

# Getting day and month from realeas date
dates[['month','day','bad_year']] = dates.release_date.str.split("/",expand=True) 
dates.head(1)

	release_year	release_date	month	day	bad_year
0	2015	6/9/15	6	9	15

dates.dtypes

release_year     int64
release_date    object
month           object
day             object
bad_year        object
dtype: object

dates['release_year'] = dates['release_year'].astype(str)
dates.dtypes

release_year    object
release_date    object
month           object
day             object
bad_year        object
dtype: object

dates['date'] = dates['release_year'] + '-' + dates['month'] + '-' + dates['day']
dates['date'] = pd.to_datetime(dates['date'])
dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   release_year  10866 non-null  object        
 1   release_date  10866 non-null  object        
 2   month         10866 non-null  object        
 3   day           10866 non-null  object        
 4   bad_year      10866 non-null  object        
 5   date          10866 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 509.5+ KB

df['release_date'] = dates['date']

df.head(1)

	popularity	budget	revenue	original_title	director	runtime	genres	release_date	release_year
0	32.985763	150000000	1513528810	Jurassic World	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	2015-06-09	2015

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   popularity      10866 non-null  float64       
 1   budget          10866 non-null  int64         
 2   revenue         10866 non-null  int64         
 3   original_title  10866 non-null  object        
 4   director        10822 non-null  object        
 5   runtime         10866 non-null  int64         
 6   genres          10843 non-null  object        
 7   release_date    10866 non-null  datetime64[ns]
 8   release_year    10866 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(4), object(3)
memory usage: 764.1+ KB

Dropping Null Values

df.dropna(inplace=True)
df.isnull().sum()

popularity        0
budget            0
revenue           0
original_title    0
director          0
runtime           0
genres            0
release_date      0
release_year      0
dtype: int64

df.head(1)

	popularity	budget	revenue	original_title	director	runtime	genres	release_date	release_year
0	32.985763	150000000	1513528810	Jurassic World	Colin Trevorrow	124	Action\|Adventure\|Science Fiction\|Thriller	2015-06-09	2015

Exploratory Data Analysis

Which movies made maximum and minimum profits?

# Maximum Profit
df['profit'] = df['revenue'] - df['budget']
df_max = df[df['profit'] == df['profit'].max()]
print(df_max)

      popularity     budget     revenue original_title       director  \
1386    9.432768  237000000  2781505847         Avatar  James Cameron   

      runtime                                    genres release_date  \
1386      162  Action|Adventure|Fantasy|Science Fiction   2009-12-10   

      release_year      profit  
1386          2009  2544505847

# Minimum Profit
df_min = df[df['profit'] == df['profit'].min()]
print(df_min)

      popularity     budget   revenue     original_title    director  runtime  \
2244     0.25054  425000000  11087569  The Warrior's Way  Sngmoo Lee      100   

                                         genres release_date  release_year  \
2244  Adventure|Fantasy|Action|Western|Thriller   2010-12-02          2010   

         profit  
2244 -413912431

The maximum Profit was made by "Avatar" and the minimum profit was made by "The Warrior's Way".

Who is most movie director?

df.director.mode()

0    Woody Allen
dtype: object

The director who directed the most was Woody Allen.

In which year there was most profit?

df.groupby('release_year').mean()['profit'].plot(kind='line', figsize = (10,10), color = 'orange',legend='profit')
plt.ylabel ('profit')
plt.title ('profits Vs release year')

Text(0.5, 1.0, 'profits Vs release year')

df.groupby('release_year').mean()['profit'].sort_values(ascending=False)

release_year
1995    3.615205e+07
1977    3.542111e+07
1992    3.486006e+07
2002    3.289090e+07
2001    3.223294e+07
2003    3.166685e+07
2004    3.134685e+07
1997    3.091145e+07
2015    3.071459e+07
1990    3.049428e+07
1989    3.003873e+07
1993    2.924024e+07
2012    2.821746e+07
2011    2.723087e+07
2007    2.707149e+07
1994    2.644686e+07
2010    2.619791e+07
2009    2.573738e+07
2005    2.527149e+07
1979    2.508738e+07
1999    2.495749e+07
1982    2.494628e+07
1991    2.436366e+07
2008    2.385986e+07
1998    2.377864e+07
2013    2.373097e+07
2014    2.364166e+07
2000    2.312390e+07
1996    2.278054e+07
1983    2.235527e+07
1987    2.202119e+07
2006    2.198420e+07
1973    2.106891e+07
1975    2.048207e+07
1985    1.969492e+07
1988    1.954308e+07
1986    1.899376e+07
1984    1.815536e+07
1980    1.802772e+07
1978    1.785819e+07
1981    1.708352e+07
1967    1.633801e+07
1974    1.599065e+07
1976    1.444374e+07
1972    1.146127e+07
1965    1.108219e+07
1970    1.083150e+07
1961    9.405909e+06
1964    7.178539e+06
1969    6.510580e+06
1971    5.980247e+06
1962    5.026804e+06
1968    4.943435e+06
1960    3.842127e+06
1963    3.355103e+06
1966    5.909106e+05
Name: profit, dtype: float64

The most average profits was made in 1995.

What is most geners in movies?

def split_compound_columns(column):
    """Split columns which has data like this; a|b|c
    Argument: column need to be seperated by '|' 
    Returns: Column of all seperated values;
    a
    b
    c
    """
    
    column = df[column].str.cat(sep = '|')
    splitted_column = pd.Series(column.split('|'))
    return splitted_column

genres = split_compound_columns('genres')
genres.value_counts().plot.pie( subplots=True,figsize=(20,20), legend=True, autopct='%.1f%%',title='a')
plt.title('Movies Genres')

Text(0.5, 1.0, 'Movies Genres')

->> The most popular movie genres are drama, comedy, thriller and action.

#plotting a histogram of the Time Duration of the movies

sns.set_style('darkgrid')

plt.rc('xtick')
plt.rc('ytick')

plt.figure(figsize=(10,7), dpi = 100)

plt.xlabel('Time Duration')
plt.ylabel('Movie Numbers')
plt.title('The Time Duration of the movies')

plt.hist(df['runtime'], rwidth = 1, bins =30)
plt.show()

The time duration of most of the movies is around [100-120] min.

plt.figure(figsize=(10,7), dpi = 100)

sns.boxplot(df['runtime'])

plt.show()

C:\Anaconda\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

What is the relation between profits over years?

df.plot(x= 'release_year' ,y= 'profit' ,kind= 'scatter', color='orange', figsize=(10,10),legend='profit')
plt.title('Relation between each year realease and Profits')

Text(0.5, 1.0, 'Relation between each year realease and Profits')

->> Positive correlation between Release Year and Profit.

df.corr()

	popularity	budget	revenue	runtime	release_year	profit
popularity	1.000000	0.544858	0.663094	0.140527	0.091347	0.628833
budget	0.544858	1.000000	0.734685	0.193883	0.117470	0.569941
revenue	0.663094	0.734685	1.000000	0.165239	0.058068	0.976165
runtime	0.140527	0.193883	0.165239	1.000000	-0.117172	0.138113
release_year	0.091347	0.117470	0.058068	-0.117172	1.000000	0.032752
profit	0.628833	0.569941	0.976165	0.138113	0.032752	1.000000

fig, ax = plt.subplots(figsize=(15,10))         
sns.heatmap(df.corr(), annot=True, fmt="f",ax=ax)
plt.title('correlation matrix for DataFrame')

Text(0.5, 1.0, 'correlation matrix for DataFrame')

Conclusions

Results

1. Not always the high budget of the movie leads to gaining high profits.
2. The most likeable genres are drama,comedy, thriller and action.
3. The less likeable genres are tv movie, western, foreign and war.
4. Dealing with popular actors in the cast besides a great director is a gurantee.

Limitations

1. Some profits are negative.
2. If we didn’t clean the data, there is no consistency in it so it’s necessary to do so.

AhmedElmehalawi / Investigating-TMDB-Movie-Dataset