Skip to content

Moha7000/TMDB-Movie-Data-Analysis

Repository files navigation

Project: TMDB Movie Data Analysis

Table of Contents

Introduction

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over 100 million dollars to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?This data set contains information about 5000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

The dataset has the following features:-

  • budget - The budget in which the movie was made.
  • genre - The genre of the movie, Action, Comedy ,Thriller etc.
  • homepage - A link to the homepage of the movie.
  • id - This is infact the movie_id as in the first dataset.
  • keywords - The keywords or tags related to the movie.
  • original_language - The language in which the movie was made.
  • original_title - The title of the movie before translation or adaptation.
  • overview - A brief description of the movie.
  • popularity - A numeric quantity specifying the movie popularity.
  • production_companies - The production house of the movie.
  • production_countries - The country in which it was produced.
  • release_date - The date on which it was released.
  • revenue - The worldwide revenue generated by the movie.
  • runtime - The running time of the movie in minutes.
  • status - "Released" or "Rumored".
  • tagline - Movie's tagline.
  • title - Title of the movie.
  • vote_average - average ratings the movie recieved.
  • vote_count - the count of votes recieved.

In this data analysis project we will answer the following questions

1: Do higher budget movies always generate big revenues?

2: Do higher budgets means higher ratings?

3: Is the month of releasing a movie affects its revenues?

4: How are movie production revenues trending over the years?

So let's go!

Important note: Github dark theme makes the plot's text invisible so enable light theme for better view

Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Functions Definitions

a function for drawing a plot

def plot_data(title, xlabel, ylabel, grid):
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    if(grid == True):
        plt.grid(True)
    plt.show()

Data Wrangling

General Properties

df = pd.read_csv('tmdb_5000_movies.csv')
df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 [{"id": 1463, "name": "culture clash"}, {"id":... en Avatar In the 22nd century, a paraplegic Marine is di... 150.437577 [{"name": "Ingenious Film Partners", "id": 289... [{"iso_3166_1": "US", "name": "United States o... 2009-12-10 2787965087 162.0 [{"iso_639_1": "en", "name": "English"}, {"iso... Released Enter the World of Pandora. Avatar 7.2 11800
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... http://disney.go.com/disneypictures/pirates/ 285 [{"id": 270, "name": "ocean"}, {"id": 726, "na... en Pirates of the Caribbean: At World's End Captain Barbossa, long believed to be dead, ha... 139.082615 [{"name": "Walt Disney Pictures", "id": 2}, {"... [{"iso_3166_1": "US", "name": "United States o... 2007-05-19 961000000 169.0 [{"iso_639_1": "en", "name": "English"}] Released At the end of the world, the adventure begins. Pirates of the Caribbean: At World's End 6.9 4500
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.sonypictures.com/movies/spectre/ 206647 [{"id": 470, "name": "spy"}, {"id": 818, "name... en Spectre A cryptic message from Bond’s past sends him o... 107.376788 [{"name": "Columbia Pictures", "id": 5}, {"nam... [{"iso_3166_1": "GB", "name": "United Kingdom"... 2015-10-26 880674609 148.0 [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... Released A Plan No One Escapes Spectre 6.3 4466
df.shape
(4803, 20)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB
df.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
budget id popularity revenue runtime vote_average vote_count
count 4.803000e+03 4803.000000 4803.000000 4.803000e+03 4801.000000 4803.000000 4803.000000
mean 2.904504e+07 57165.484281 21.492301 8.226064e+07 106.875859 6.092172 690.217989
std 4.072239e+07 88694.614033 31.816650 1.628571e+08 22.611935 1.194612 1234.585891
min 0.000000e+00 5.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000
25% 7.900000e+05 9014.500000 4.668070 0.000000e+00 94.000000 5.600000 54.000000
50% 1.500000e+07 14629.000000 12.921594 1.917000e+07 103.000000 6.200000 235.000000
75% 4.000000e+07 58610.500000 28.313505 9.291719e+07 118.000000 6.800000 737.000000
max 3.800000e+08 459488.000000 875.581305 2.787965e+09 338.000000 10.000000 13752.000000

Data Cleaning

dropping unwanted columns
df.drop(['id','homepage','overview','tagline','keywords','production_companies','production_countries','spoken_languages'], axis=1 , inplace=True)
converting release_date datatype from object to datetime64
df = df.astype({"release_date":"datetime64[ns]"})
checking null values
df[df.release_date.isnull()]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
budget genres original_language original_title popularity release_date revenue runtime status title vote_average vote_count
4553 0 [] en America Is Still the Place 0.0 NaT 0 0.0 Released America Is Still the Place 0.0 0
df[df.runtime.isnull()]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
budget genres original_language original_title popularity release_date revenue runtime status title vote_average vote_count
2656 15000000 [{"id": 18, "name": "Drama"}] it Chiamatemi Francesco - Il Papa della gente 0.738646 2015-12-03 0 NaN Released Chiamatemi Francesco - Il Papa della gente 7.3 12
4140 2 [{"id": 99, "name": "Documentary"}] en To Be Frank, Sinatra at 100 0.050625 2015-12-12 0 NaN Released To Be Frank, Sinatra at 100 0.0 0

Dropping null values since their data aren't enough for analysis

df.dropna(inplace = True)

Separating release date into year and month of release

df['year_of_release'] = df['release_date'].dt.to_period("y")
df['month_of_release'] = df['release_date'].dt.month

After cleaning data this is what it look like

df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
budget genres original_language original_title popularity release_date revenue runtime status title vote_average vote_count year_of_release month_of_release
0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... en Avatar 150.437577 2009-12-10 2787965087 162.0 Released Avatar 7.2 11800 2009 12
1 300000000 [{"id": 12, "name": "Adventure"}, {"id": 14, "... en Pirates of the Caribbean: At World's End 139.082615 2007-05-19 961000000 169.0 Released Pirates of the Caribbean: At World's End 6.9 4500 2007 5
2 245000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... en Spectre 107.376788 2015-10-26 880674609 148.0 Released Spectre 6.3 4466 2015 10
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 14 columns):
budget               4800 non-null int64
genres               4800 non-null object
original_language    4800 non-null object
original_title       4800 non-null object
popularity           4800 non-null float64
release_date         4800 non-null datetime64[ns]
revenue              4800 non-null int64
runtime              4800 non-null float64
status               4800 non-null object
title                4800 non-null object
vote_average         4800 non-null float64
vote_count           4800 non-null int64
year_of_release      4800 non-null object
month_of_release     4800 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage: 562.5+ KB
df.hist(figsize=(15,8));

png

Exploratory Data Analysis

Research Question 1: Do higher budget movies always generate big revenues?

df.plot(x='budget',y='revenue',kind='scatter',figsize=(10,5));
plot_data('Relation Between Budget and Revenue', 'Budget in 100s of million ($)', 'Revenue in billions ($)', False)

png

From the previous plot

1: Do higher budget movies always generate big revenues?

Ignoring the outliers, the answer to this question from the graph is No.

Research Question 2: Do higher budgets means higher ratings?

df.plot(x='budget',y='vote_average',kind='scatter',figsize=(10,6));
plot_data('Relation Between budget and rating', 'budget in 100 millions ($)', 'rating', True)

png

From the previous plot

2: Do higher budgets means higher ratings?

Not exactly, but as the budget increase the possibility of low rating decrease as we see when the budget is over 100 million dollars, most of the movies are above 6.

Research Question 3: Is the month of releasing a movie affects its revenues?

revenue_by_month = df.groupby('month_of_release')['revenue'].mean()
months=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
new_df = pd.DataFrame() #creating new dataframe that holds months names and the revenues average in that month

new_df['months'] = months
new_df['revenues'] = list(revenue_by_month)

new_df.plot(x='months',y='revenues',kind='bar',figsize=(10,6));\

plot_data('Relation Between Month of Release and Revenues', 'Months', 'Revenue in billions ($)', True)

png

Revenues average for each month

new_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
months revenues
0 Jan 2.863406e+07
1 Feb 5.613842e+07
2 Mar 7.310316e+07
3 Apr 7.392762e+07
4 May 1.339301e+08
5 Jun 1.522845e+08
6 Jul 1.115768e+08
7 Aug 5.580475e+07
8 Sep 3.999196e+07
9 Oct 5.289629e+07
10 Nov 1.288861e+08
11 Dec 1.118489e+08

From the previous plot

3: Is there a relation between month of release and revenues?

yes, there are months that have higher probability to get higher revenues such as ( June, May , November , December and July ) respectivly.

Research Question 4: How are movie production revenues trending over the years?

revenue_by_year = df.groupby('year_of_release')['revenue'].mean()

df_years = pd.DataFrame() #Creatimg new dataframr that holds averege revenues for each year

df_years['revenues'] = list(revenue_by_year)
df_years['year'] = list(revenue_by_year.keys())

df_years.plot(x='year',y='revenues',kind='bar',figsize=(18,6));
plot_data('Movie production revenues varying over the years', 'Years', 'Revenue in billions ($)' ,True )

png

From the previous plot

4: How are movie production revenues trending over the years?

Movie production rising over the years as we can see from the graph.

Conclusions

  • In movie production Higher budgets doesn't always mean that the movie will get high revenues or even to get high rating but as the budget increase the possibility of low rating decrease.

  • Releasing date of a movie will affect the success of that movie.

  • Movie production is rising over the years and the revenues increase in addition ofcourse to the budget of the movies

Limitations

  • There is no normalization or exchange rate or currency conversion is considered during this analysis and our analysis is limited to the numerical values of revenue.

  • Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards the relationship being analyzed. etc.

from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
0

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published