Project: TMDB Movie Data Analysis

Introduction

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over 100 million dollars to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?This data set contains information about 5000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

The dataset has the following features:-

budget - The budget in which the movie was made.
genre - The genre of the movie, Action, Comedy ,Thriller etc.
homepage - A link to the homepage of the movie.
id - This is infact the movie_id as in the first dataset.
keywords - The keywords or tags related to the movie.
original_language - The language in which the movie was made.
original_title - The title of the movie before translation or adaptation.
overview - A brief description of the movie.
popularity - A numeric quantity specifying the movie popularity.
production_companies - The production house of the movie.
production_countries - The country in which it was produced.
release_date - The date on which it was released.
revenue - The worldwide revenue generated by the movie.
runtime - The running time of the movie in minutes.
status - "Released" or "Rumored".
tagline - Movie's tagline.
title - Title of the movie.
vote_average - average ratings the movie recieved.
vote_count - the count of votes recieved.

In this data analysis project we will answer the following questions

1: Do higher budget movies always generate big revenues?

2: Do higher budgets means higher ratings?

3: Is the month of releasing a movie affects its revenues?

4: How are movie production revenues trending over the years?

So let's go!

Important note: Github dark theme makes the plot's text invisible so enable light theme for better view

Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Functions Definitions

a function for drawing a plot

def plot_data(title, xlabel, ylabel, grid):
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    if(grid == True):
        plt.grid(True)
    plt.show()

Data Wrangling

General Properties

df = pd.read_csv('tmdb_5000_movies.csv')
df.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466

df.shape

(4803, 20)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB

df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	budget	id	popularity	revenue	runtime	vote_average	vote_count
count	4.803000e+03	4803.000000	4803.000000	4.803000e+03	4801.000000	4803.000000	4803.000000
mean	2.904504e+07	57165.484281	21.492301	8.226064e+07	106.875859	6.092172	690.217989
std	4.072239e+07	88694.614033	31.816650	1.628571e+08	22.611935	1.194612	1234.585891
min	0.000000e+00	5.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000
25%	7.900000e+05	9014.500000	4.668070	0.000000e+00	94.000000	5.600000	54.000000
50%	1.500000e+07	14629.000000	12.921594	1.917000e+07	103.000000	6.200000	235.000000
75%	4.000000e+07	58610.500000	28.313505	9.291719e+07	118.000000	6.800000	737.000000
max	3.800000e+08	459488.000000	875.581305	2.787965e+09	338.000000	10.000000	13752.000000

Data Cleaning

dropping unwanted columns

df.drop(['id','homepage','overview','tagline','keywords','production_companies','production_countries','spoken_languages'], axis=1 , inplace=True)

converting release_date datatype from object to datetime64

df = df.astype({"release_date":"datetime64[ns]"})

checking null values

df[df.release_date.isnull()]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	budget	genres	original_language	original_title	popularity	release_date	revenue	runtime	status	title	vote_average	vote_count
4553	0	[]	en	America Is Still the Place	0.0	NaT	0	0.0	Released	America Is Still the Place	0.0	0

df[df.runtime.isnull()]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	budget	genres	original_language	original_title	popularity	release_date	revenue	runtime	status	title	vote_average	vote_count
2656	15000000	[{"id": 18, "name": "Drama"}]	it	Chiamatemi Francesco - Il Papa della gente	0.738646	2015-12-03	0	NaN	Released	Chiamatemi Francesco - Il Papa della gente	7.3	12
4140	2	[{"id": 99, "name": "Documentary"}]	en	To Be Frank, Sinatra at 100	0.050625	2015-12-12	0	NaN	Released	To Be Frank, Sinatra at 100	0.0	0

Dropping null values since their data aren't enough for analysis

df.dropna(inplace = True)

Separating release date into year and month of release

df['year_of_release'] = df['release_date'].dt.to_period("y")
df['month_of_release'] = df['release_date'].dt.month

After cleaning data this is what it look like

df.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	budget	genres	original_language	original_title	popularity	release_date	revenue	runtime	status	title	vote_average	vote_count	year_of_release	month_of_release
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	en	Avatar	150.437577	2009-12-10	2787965087	162.0	Released	Avatar	7.2	11800	2009	12
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	en	Pirates of the Caribbean: At World's End	139.082615	2007-05-19	961000000	169.0	Released	Pirates of the Caribbean: At World's End	6.9	4500	2007	5
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	en	Spectre	107.376788	2015-10-26	880674609	148.0	Released	Spectre	6.3	4466	2015	10

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 14 columns):
budget               4800 non-null int64
genres               4800 non-null object
original_language    4800 non-null object
original_title       4800 non-null object
popularity           4800 non-null float64
release_date         4800 non-null datetime64[ns]
revenue              4800 non-null int64
runtime              4800 non-null float64
status               4800 non-null object
title                4800 non-null object
vote_average         4800 non-null float64
vote_count           4800 non-null int64
year_of_release      4800 non-null object
month_of_release     4800 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage: 562.5+ KB

df.hist(figsize=(15,8));

Exploratory Data Analysis

Research Question 1: Do higher budget movies always generate big revenues?

df.plot(x='budget',y='revenue',kind='scatter',figsize=(10,5));
plot_data('Relation Between Budget and Revenue', 'Budget in 100s of million ($)', 'Revenue in billions ($)', False)

From the previous plot

1: Do higher budget movies always generate big revenues?

Ignoring the outliers, the answer to this question from the graph is No.

Research Question 2: Do higher budgets means higher ratings?

df.plot(x='budget',y='vote_average',kind='scatter',figsize=(10,6));
plot_data('Relation Between budget and rating', 'budget in 100 millions ($)', 'rating', True)

From the previous plot

2: Do higher budgets means higher ratings?

Not exactly, but as the budget increase the possibility of low rating decrease as we see when the budget is over 100 million dollars, most of the movies are above 6.

Research Question 3: Is the month of releasing a movie affects its revenues?

revenue_by_month = df.groupby('month_of_release')['revenue'].mean()
months=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
new_df = pd.DataFrame() #creating new dataframe that holds months names and the revenues average in that month

new_df['months'] = months
new_df['revenues'] = list(revenue_by_month)

new_df.plot(x='months',y='revenues',kind='bar',figsize=(10,6));\

plot_data('Relation Between Month of Release and Revenues', 'Months', 'Revenue in billions ($)', True)

Revenues average for each month

new_df

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	months	revenues
0	Jan	2.863406e+07
1	Feb	5.613842e+07
2	Mar	7.310316e+07
3	Apr	7.392762e+07
4	May	1.339301e+08
5	Jun	1.522845e+08
6	Jul	1.115768e+08
7	Aug	5.580475e+07
8	Sep	3.999196e+07
9	Oct	5.289629e+07
10	Nov	1.288861e+08
11	Dec	1.118489e+08

From the previous plot

3: Is there a relation between month of release and revenues?

yes, there are months that have higher probability to get higher revenues such as ( June, May , November , December and July ) respectivly.

Research Question 4: How are movie production revenues trending over the years?

revenue_by_year = df.groupby('year_of_release')['revenue'].mean()

df_years = pd.DataFrame() #Creatimg new dataframr that holds averege revenues for each year

df_years['revenues'] = list(revenue_by_year)
df_years['year'] = list(revenue_by_year.keys())

df_years.plot(x='year',y='revenues',kind='bar',figsize=(18,6));
plot_data('Movie production revenues varying over the years', 'Years', 'Revenue in billions ($)' ,True )

From the previous plot

4: How are movie production revenues trending over the years?

Movie production rising over the years as we can see from the graph.

Conclusions

In movie production Higher budgets doesn't always mean that the movie will get high revenues or even to get high rating but as the budget increase the possibility of low rating decrease.
Releasing date of a movie will affect the success of that movie.
Movie production is rising over the years and the revenues increase in addition ofcourse to the budget of the movies

Limitations

There is no normalization or exchange rate or currency conversion is considered during this analysis and our analysis is limited to the numerical values of revenue.
Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards the relationship being analyzed. etc.

from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Investigate_a_Dataset.ipynb		Investigate_a_Dataset.ipynb
README.md		README.md
output_27_0.png		output_27_0.png
output_29_0.png		output_29_0.png
output_32_0.png		output_32_0.png
output_35_0.png		output_35_0.png
output_40_0.png		output_40_0.png
tmdb_5000_movies.csv		tmdb_5000_movies.csv

Moha7000/TMDB-Movie-Data-Analysis

Folders and files

Latest commit

History

Repository files navigation

Project: TMDB Movie Data Analysis

Table of Contents

Introduction

The dataset has the following features:-

In this data analysis project we will answer the following questions

1: Do higher budget movies always generate big revenues?

2: Do higher budgets means higher ratings?

3: Is the month of releasing a movie affects its revenues?

4: How are movie production revenues trending over the years?

So let's go!

Important note: Github dark theme makes the plot's text invisible so enable light theme for better view

Importing libraries

Functions Definitions

a function for drawing a plot

Data Wrangling

General Properties

Data Cleaning

dropping unwanted columns

converting release_date datatype from object to datetime64

checking null values

Dropping null values since their data aren't enough for analysis

Separating release date into year and month of release

After cleaning data this is what it look like

Exploratory Data Analysis

Research Question 1: Do higher budget movies always generate big revenues?

From the previous plot

Research Question 2: Do higher budgets means higher ratings?

From the previous plot

Research Question 3: Is the month of releasing a movie affects its revenues?

Revenues average for each month

From the previous plot

Research Question 4: How are movie production revenues trending over the years?

From the previous plot

Conclusions

In movie production Higher budgets doesn't always mean that the movie will get high revenues or even to get high rating but as the budget increase the possibility of low rating decrease.

Releasing date of a movie will affect the success of that movie.

Movie production is rising over the years and the revenues increase in addition ofcourse to the budget of the movies

Limitations

There is no normalization or exchange rate or currency conversion is considered during this analysis and our analysis is limited to the numerical values of revenue.

Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards the relationship being analyzed. etc.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages