What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over 100 million dollars to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?This data set contains information about 5000 movies collected from The Movie Database (TMDb), including user ratings and revenue.
- budget - The budget in which the movie was made.
- genre - The genre of the movie, Action, Comedy ,Thriller etc.
- homepage - A link to the homepage of the movie.
- id - This is infact the movie_id as in the first dataset.
- keywords - The keywords or tags related to the movie.
- original_language - The language in which the movie was made.
- original_title - The title of the movie before translation or adaptation.
- overview - A brief description of the movie.
- popularity - A numeric quantity specifying the movie popularity.
- production_companies - The production house of the movie.
- production_countries - The country in which it was produced.
- release_date - The date on which it was released.
- revenue - The worldwide revenue generated by the movie.
- runtime - The running time of the movie in minutes.
- status - "Released" or "Rumored".
- tagline - Movie's tagline.
- title - Title of the movie.
- vote_average - average ratings the movie recieved.
- vote_count - the count of votes recieved.
Important note: Github dark theme makes the plot's text invisible so enable light theme for better view
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedef plot_data(title, xlabel, ylabel, grid):
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
if(grid == True):
plt.grid(True)
plt.show()df = pd.read_csv('tmdb_5000_movies.csv')
df.head(3)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 237000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id":... | en | Avatar | In the 22nd century, a paraplegic Marine is di... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289... | [{"iso_3166_1": "US", "name": "United States o... | 2009-12-10 | 2787965087 | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso... | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 |
| 1 | 300000000 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "na... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, ha... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"... | [{"iso_3166_1": "US", "name": "United States o... | 2007-05-19 | 961000000 | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 |
| 2 | 245000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | http://www.sonypictures.com/movies/spectre/ | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name... | en | Spectre | A cryptic message from Bond’s past sends him o... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"nam... | [{"iso_3166_1": "GB", "name": "United Kingdom"... | 2015-10-26 | 880674609 | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"},... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 |
df.shape(4803, 20)
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget 4803 non-null int64
genres 4803 non-null object
homepage 1712 non-null object
id 4803 non-null int64
keywords 4803 non-null object
original_language 4803 non-null object
original_title 4803 non-null object
overview 4800 non-null object
popularity 4803 non-null float64
production_companies 4803 non-null object
production_countries 4803 non-null object
release_date 4802 non-null object
revenue 4803 non-null int64
runtime 4801 non-null float64
spoken_languages 4803 non-null object
status 4803 non-null object
tagline 3959 non-null object
title 4803 non-null object
vote_average 4803 non-null float64
vote_count 4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB
df.describe()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| budget | id | popularity | revenue | runtime | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|
| count | 4.803000e+03 | 4803.000000 | 4803.000000 | 4.803000e+03 | 4801.000000 | 4803.000000 | 4803.000000 |
| mean | 2.904504e+07 | 57165.484281 | 21.492301 | 8.226064e+07 | 106.875859 | 6.092172 | 690.217989 |
| std | 4.072239e+07 | 88694.614033 | 31.816650 | 1.628571e+08 | 22.611935 | 1.194612 | 1234.585891 |
| min | 0.000000e+00 | 5.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 7.900000e+05 | 9014.500000 | 4.668070 | 0.000000e+00 | 94.000000 | 5.600000 | 54.000000 |
| 50% | 1.500000e+07 | 14629.000000 | 12.921594 | 1.917000e+07 | 103.000000 | 6.200000 | 235.000000 |
| 75% | 4.000000e+07 | 58610.500000 | 28.313505 | 9.291719e+07 | 118.000000 | 6.800000 | 737.000000 |
| max | 3.800000e+08 | 459488.000000 | 875.581305 | 2.787965e+09 | 338.000000 | 10.000000 | 13752.000000 |
df.drop(['id','homepage','overview','tagline','keywords','production_companies','production_countries','spoken_languages'], axis=1 , inplace=True)df = df.astype({"release_date":"datetime64[ns]"})df[df.release_date.isnull()]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| budget | genres | original_language | original_title | popularity | release_date | revenue | runtime | status | title | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4553 | 0 | [] | en | America Is Still the Place | 0.0 | NaT | 0 | 0.0 | Released | America Is Still the Place | 0.0 | 0 |
df[df.runtime.isnull()]
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| budget | genres | original_language | original_title | popularity | release_date | revenue | runtime | status | title | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2656 | 15000000 | [{"id": 18, "name": "Drama"}] | it | Chiamatemi Francesco - Il Papa della gente | 0.738646 | 2015-12-03 | 0 | NaN | Released | Chiamatemi Francesco - Il Papa della gente | 7.3 | 12 |
| 4140 | 2 | [{"id": 99, "name": "Documentary"}] | en | To Be Frank, Sinatra at 100 | 0.050625 | 2015-12-12 | 0 | NaN | Released | To Be Frank, Sinatra at 100 | 0.0 | 0 |
df.dropna(inplace = True)df['year_of_release'] = df['release_date'].dt.to_period("y")
df['month_of_release'] = df['release_date'].dt.monthdf.head(3)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| budget | genres | original_language | original_title | popularity | release_date | revenue | runtime | status | title | vote_average | vote_count | year_of_release | month_of_release | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 237000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | en | Avatar | 150.437577 | 2009-12-10 | 2787965087 | 162.0 | Released | Avatar | 7.2 | 11800 | 2009 | 12 |
| 1 | 300000000 | [{"id": 12, "name": "Adventure"}, {"id": 14, "... | en | Pirates of the Caribbean: At World's End | 139.082615 | 2007-05-19 | 961000000 | 169.0 | Released | Pirates of the Caribbean: At World's End | 6.9 | 4500 | 2007 | 5 |
| 2 | 245000000 | [{"id": 28, "name": "Action"}, {"id": 12, "nam... | en | Spectre | 107.376788 | 2015-10-26 | 880674609 | 148.0 | Released | Spectre | 6.3 | 4466 | 2015 | 10 |
df.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 14 columns):
budget 4800 non-null int64
genres 4800 non-null object
original_language 4800 non-null object
original_title 4800 non-null object
popularity 4800 non-null float64
release_date 4800 non-null datetime64[ns]
revenue 4800 non-null int64
runtime 4800 non-null float64
status 4800 non-null object
title 4800 non-null object
vote_average 4800 non-null float64
vote_count 4800 non-null int64
year_of_release 4800 non-null object
month_of_release 4800 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(4), object(6)
memory usage: 562.5+ KB
df.hist(figsize=(15,8));df.plot(x='budget',y='revenue',kind='scatter',figsize=(10,5));
plot_data('Relation Between Budget and Revenue', 'Budget in 100s of million ($)', 'Revenue in billions ($)', False)1: Do higher budget movies always generate big revenues?
Ignoring the outliers, the answer to this question from the graph is No.
df.plot(x='budget',y='vote_average',kind='scatter',figsize=(10,6));
plot_data('Relation Between budget and rating', 'budget in 100 millions ($)', 'rating', True)2: Do higher budgets means higher ratings?
Not exactly, but as the budget increase the possibility of low rating decrease as we see when the budget is over 100 million dollars, most of the movies are above 6.
revenue_by_month = df.groupby('month_of_release')['revenue'].mean()
months=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
new_df = pd.DataFrame() #creating new dataframe that holds months names and the revenues average in that month
new_df['months'] = months
new_df['revenues'] = list(revenue_by_month)
new_df.plot(x='months',y='revenues',kind='bar',figsize=(10,6));\
plot_data('Relation Between Month of Release and Revenues', 'Months', 'Revenue in billions ($)', True)new_df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
| months | revenues | |
|---|---|---|
| 0 | Jan | 2.863406e+07 |
| 1 | Feb | 5.613842e+07 |
| 2 | Mar | 7.310316e+07 |
| 3 | Apr | 7.392762e+07 |
| 4 | May | 1.339301e+08 |
| 5 | Jun | 1.522845e+08 |
| 6 | Jul | 1.115768e+08 |
| 7 | Aug | 5.580475e+07 |
| 8 | Sep | 3.999196e+07 |
| 9 | Oct | 5.289629e+07 |
| 10 | Nov | 1.288861e+08 |
| 11 | Dec | 1.118489e+08 |
3: Is there a relation between month of release and revenues?
yes, there are months that have higher probability to get higher revenues such as ( June, May , November , December and July ) respectivly.
revenue_by_year = df.groupby('year_of_release')['revenue'].mean()
df_years = pd.DataFrame() #Creatimg new dataframr that holds averege revenues for each year
df_years['revenues'] = list(revenue_by_year)
df_years['year'] = list(revenue_by_year.keys())
df_years.plot(x='year',y='revenues',kind='bar',figsize=(18,6));
plot_data('Movie production revenues varying over the years', 'Years', 'Revenue in billions ($)' ,True )4: How are movie production revenues trending over the years?
Movie production rising over the years as we can see from the graph.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])0




