With the use of TMDB movie dataset which contains information about 10,000 movies, including user ratings and revenue, we are gonna investigate this dataset in order to answer some questions about it and extract some conclusions.
- Which movies made maximum and minimum and minimum profits?
- Who is most movie director?
- In which year there was most profit?
- What is most geners in movies?
- What is the relation between profits over years?
# Dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv('tmdb-movies.csv')
# The first row of df
df.head(1)
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
1 rows × 21 columns
df.describe()
id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|
count | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 |
mean | 66064.177434 | 0.646441 | 1.462570e+07 | 3.982332e+07 | 102.070863 | 217.389748 | 5.974922 | 2001.322658 | 1.755104e+07 | 5.136436e+07 |
std | 92130.136561 | 1.000185 | 3.091321e+07 | 1.170035e+08 | 31.381405 | 575.619058 | 0.935142 | 12.812941 | 3.430616e+07 | 1.446325e+08 |
min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
25% | 10596.250000 | 0.207583 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
50% | 20669.000000 | 0.383856 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
75% | 75610.000000 | 0.713817 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 145.750000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.369710e+07 |
max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10866 non-null int64
1 imdb_id 10856 non-null object
2 popularity 10866 non-null float64
3 budget 10866 non-null int64
4 revenue 10866 non-null int64
5 original_title 10866 non-null object
6 cast 10790 non-null object
7 homepage 2936 non-null object
8 director 10822 non-null object
9 tagline 8042 non-null object
10 keywords 9373 non-null object
11 overview 10862 non-null object
12 runtime 10866 non-null int64
13 genres 10843 non-null object
14 production_companies 9836 non-null object
15 release_date 10866 non-null object
16 vote_count 10866 non-null int64
17 vote_average 10866 non-null float64
18 release_year 10866 non-null int64
19 budget_adj 10866 non-null float64
20 revenue_adj 10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
extraneous_columns = ['id', 'imdb_id', 'homepage', 'tagline', 'keywords', 'overview', 'budget_adj',
'revenue_adj', 'vote_count', 'vote_average', 'production_companies', 'cast']
df.drop(extraneous_columns, axis=1, inplace=True)
df.head(1)
popularity | budget | revenue | original_title | director | runtime | genres | release_date | release_year | |
---|---|---|---|---|---|---|---|---|---|
0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Colin Trevorrow | 124 | Action|Adventure|Science Fiction|Thriller | 6/9/15 | 2015 |
# Dataframe for dates
dates = df[['release_year', 'release_date']].copy()
dates.head(1)
release_year | release_date | |
---|---|---|
0 | 2015 | 6/9/15 |
# Getting day and month from realeas date
dates[['month','day','bad_year']] = dates.release_date.str.split("/",expand=True)
dates.head(1)
release_year | release_date | month | day | bad_year | |
---|---|---|---|---|---|
0 | 2015 | 6/9/15 | 6 | 9 | 15 |
dates.dtypes
release_year int64
release_date object
month object
day object
bad_year object
dtype: object
dates['release_year'] = dates['release_year'].astype(str)
dates.dtypes
release_year object
release_date object
month object
day object
bad_year object
dtype: object
dates['date'] = dates['release_year'] + '-' + dates['month'] + '-' + dates['day']
dates['date'] = pd.to_datetime(dates['date'])
dates.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 release_year 10866 non-null object
1 release_date 10866 non-null object
2 month 10866 non-null object
3 day 10866 non-null object
4 bad_year 10866 non-null object
5 date 10866 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 509.5+ KB
df['release_date'] = dates['date']
df.head(1)
popularity | budget | revenue | original_title | director | runtime | genres | release_date | release_year | |
---|---|---|---|---|---|---|---|---|---|
0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Colin Trevorrow | 124 | Action|Adventure|Science Fiction|Thriller | 2015-06-09 | 2015 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 popularity 10866 non-null float64
1 budget 10866 non-null int64
2 revenue 10866 non-null int64
3 original_title 10866 non-null object
4 director 10822 non-null object
5 runtime 10866 non-null int64
6 genres 10843 non-null object
7 release_date 10866 non-null datetime64[ns]
8 release_year 10866 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(4), object(3)
memory usage: 764.1+ KB
df.dropna(inplace=True)
df.isnull().sum()
popularity 0
budget 0
revenue 0
original_title 0
director 0
runtime 0
genres 0
release_date 0
release_year 0
dtype: int64
df.head(1)
popularity | budget | revenue | original_title | director | runtime | genres | release_date | release_year | |
---|---|---|---|---|---|---|---|---|---|
0 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Colin Trevorrow | 124 | Action|Adventure|Science Fiction|Thriller | 2015-06-09 | 2015 |
# Maximum Profit
df['profit'] = df['revenue'] - df['budget']
df_max = df[df['profit'] == df['profit'].max()]
print(df_max)
popularity budget revenue original_title director \
1386 9.432768 237000000 2781505847 Avatar James Cameron
runtime genres release_date \
1386 162 Action|Adventure|Fantasy|Science Fiction 2009-12-10
release_year profit
1386 2009 2544505847
# Minimum Profit
df_min = df[df['profit'] == df['profit'].min()]
print(df_min)
popularity budget revenue original_title director runtime \
2244 0.25054 425000000 11087569 The Warrior's Way Sngmoo Lee 100
genres release_date release_year \
2244 Adventure|Fantasy|Action|Western|Thriller 2010-12-02 2010
profit
2244 -413912431
The maximum Profit was made by "Avatar" and the minimum profit was made by "The Warrior's Way".
df.director.mode()
0 Woody Allen
dtype: object
The director who directed the most was Woody Allen.
df.groupby('release_year').mean()['profit'].plot(kind='line', figsize = (10,10), color = 'orange',legend='profit')
plt.ylabel ('profit')
plt.title ('profits Vs release year')
Text(0.5, 1.0, 'profits Vs release year')
df.groupby('release_year').mean()['profit'].sort_values(ascending=False)
release_year
1995 3.615205e+07
1977 3.542111e+07
1992 3.486006e+07
2002 3.289090e+07
2001 3.223294e+07
2003 3.166685e+07
2004 3.134685e+07
1997 3.091145e+07
2015 3.071459e+07
1990 3.049428e+07
1989 3.003873e+07
1993 2.924024e+07
2012 2.821746e+07
2011 2.723087e+07
2007 2.707149e+07
1994 2.644686e+07
2010 2.619791e+07
2009 2.573738e+07
2005 2.527149e+07
1979 2.508738e+07
1999 2.495749e+07
1982 2.494628e+07
1991 2.436366e+07
2008 2.385986e+07
1998 2.377864e+07
2013 2.373097e+07
2014 2.364166e+07
2000 2.312390e+07
1996 2.278054e+07
1983 2.235527e+07
1987 2.202119e+07
2006 2.198420e+07
1973 2.106891e+07
1975 2.048207e+07
1985 1.969492e+07
1988 1.954308e+07
1986 1.899376e+07
1984 1.815536e+07
1980 1.802772e+07
1978 1.785819e+07
1981 1.708352e+07
1967 1.633801e+07
1974 1.599065e+07
1976 1.444374e+07
1972 1.146127e+07
1965 1.108219e+07
1970 1.083150e+07
1961 9.405909e+06
1964 7.178539e+06
1969 6.510580e+06
1971 5.980247e+06
1962 5.026804e+06
1968 4.943435e+06
1960 3.842127e+06
1963 3.355103e+06
1966 5.909106e+05
Name: profit, dtype: float64
The most average profits was made in 1995.
def split_compound_columns(column):
"""Split columns which has data like this; a|b|c
Argument: column need to be seperated by '|'
Returns: Column of all seperated values;
a
b
c
"""
column = df[column].str.cat(sep = '|')
splitted_column = pd.Series(column.split('|'))
return splitted_column
genres = split_compound_columns('genres')
genres.value_counts().plot.pie( subplots=True,figsize=(20,20), legend=True, autopct='%.1f%%',title='a')
plt.title('Movies Genres')
Text(0.5, 1.0, 'Movies Genres')
->> The most popular movie genres are drama, comedy, thriller and action.
#plotting a histogram of the Time Duration of the movies
sns.set_style('darkgrid')
plt.rc('xtick')
plt.rc('ytick')
plt.figure(figsize=(10,7), dpi = 100)
plt.xlabel('Time Duration')
plt.ylabel('Movie Numbers')
plt.title('The Time Duration of the movies')
plt.hist(df['runtime'], rwidth = 1, bins =30)
plt.show()
The time duration of most of the movies is around [100-120] min.
plt.figure(figsize=(10,7), dpi = 100)
sns.boxplot(df['runtime'])
plt.show()
C:\Anaconda\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
FutureWarning
df.plot(x= 'release_year' ,y= 'profit' ,kind= 'scatter', color='orange', figsize=(10,10),legend='profit')
plt.title('Relation between each year realease and Profits')
Text(0.5, 1.0, 'Relation between each year realease and Profits')
->> Positive correlation between Release Year and Profit.
df.corr()
popularity | budget | revenue | runtime | release_year | profit | |
---|---|---|---|---|---|---|
popularity | 1.000000 | 0.544858 | 0.663094 | 0.140527 | 0.091347 | 0.628833 |
budget | 0.544858 | 1.000000 | 0.734685 | 0.193883 | 0.117470 | 0.569941 |
revenue | 0.663094 | 0.734685 | 1.000000 | 0.165239 | 0.058068 | 0.976165 |
runtime | 0.140527 | 0.193883 | 0.165239 | 1.000000 | -0.117172 | 0.138113 |
release_year | 0.091347 | 0.117470 | 0.058068 | -0.117172 | 1.000000 | 0.032752 |
profit | 0.628833 | 0.569941 | 0.976165 | 0.138113 | 0.032752 | 1.000000 |
fig, ax = plt.subplots(figsize=(15,10))
sns.heatmap(df.corr(), annot=True, fmt="f",ax=ax)
plt.title('correlation matrix for DataFrame')
Text(0.5, 1.0, 'correlation matrix for DataFrame')
1. Not always the high budget of the movie leads to gaining high profits.
2. The most likeable genres are drama,comedy, thriller and action.
3. The less likeable genres are tv movie, western, foreign and war.
4. Dealing with popular actors in the cast besides a great director is a gurantee.
1. Some profits are negative.
2. If we didn’t clean the data, there is no consistency in it so it’s necessary to do so.