box-office predicts movie box office revenues of feature length films to identify stock market opportunities in media properties. The tool is based on critic reviews, film characteristics, production budget, and what studio and players are involved. Producing a movie is a highly risky endeavor and studios rely on only a handful of extremely expensive movies every year to make sure they remain profitable. Box office hits and misses correspond to short-term changes in stock prices of media properties.
Project utilizes web scraping, (Natural Language Processing, NLP) on SENTIMENT ANALYSIS, and feature selection to identify factors that best predict box office success using machine learning techniques (Ensemble Methods including Random Forest & Boosting, along with a Recurrent Neural Network for sentiment analysis and Clustering Methods for binning individual features) and big data analytics.
box-office is a revenue predicting tool for feature length films. The movie industry is a multi-billion dollar industry, generating approximately $40 billion of revenue annually worldwide. However, investing in the production of a feature length film is a highly risky endeavor and studios rely on only a handful of extremely expensive movies every year to make sure they remain profitable. Over the last decade, 80% of the industry’s profits was generated from just 6% of the films released; and 78% of movies have lost money of the same time period.
According to Jack Valenti, President and CEO of the Motion Picture Association of America (MPAA): “No one can tell you how a movie is going to do in the marketplace. Not until the film opens in darkened theatre and sparks fly up between the screen and the audience.”
This project aims to identify the predictive features of box office revenues, which will help studios and investors better measure the risk taken on producing different films, helping the stake-holders to better plan for execute movies that audiences will enjoy and are financially profitable.
The aim of box-office is to:
- Predict total revenue of feature length films by investigating extent to which players involves and movie characteristics determine the overall market success of the film
- Examine the impact of positive or negative critic reviews
- Examine the relationship between weekend box-office revenues and the stock prices of the media and entertainment companies involved
- Rotten Tomatoes (critic reviews): Rotten Tomatoes: Film review aggregator, a site where people can get access to reviews from a variety of critics in the United States.
- General Movie Information (ID, title, year, actors, producer, directors, writers, lifetime earnings): IMDb: An online database of information related to films & television programs, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews.
- Federal Reserve Economic Data (Macroeconomic Indicators) FRED: A database maintained by the Research division of the Federal Reserve Bank of St. Louis.
- BoxOffice Mojo (daily box office revenues, production budget, premier date, genre, production company/studio): BoxOffice Mojo: Tracks box office revenue in a systematic, algorithmic way
- Yahoo Finance API (S&P500 daily index, media company stock prices): Yahoo Finance API: Yahoo finance provides stock data.
- IMDB movie reviews:
The data set used in this project consists of over 75,000 critic reviews from a hundred or so publications, and the full archives of the Internet Movie Database, which was loaded into 43 different tables, some of which had more than 50 million individual data entries. A number of macroeconomic indicators were used as well.
Analysis began with the collection of movie data from boxoffice mojo.
- Python: the main coding language for this project.
- Beautiful Soup: a Python library designed for web-scraping. It provides strong parse power especially HTML.
- NLTK: Natual Language Toolkit, a Python library that provides support for Natural Language Processing including stopwords lists, word Stemmer and Lemmatizer and etc.
- sklearn: Scikit-Learn, a Python library that provides all sorts of machine learning libraries and packages.
- Flask: a microframework for Python based on Werkzeug, Jinja 2.
- d3.js: Data-Driven Documents, a JavaScript Library that helps interactively visualizing data and telling stories about the data.
- nvd3: a JavaScript wrapper for d3.js.
- word2vec: used for learning vector representations of words, called "word embeddings". These representations can be subsequently used in many natural language processing applications and for further research.
A special thank you to:
- rottentomatoes.com for providing the critic reviews
- imdb.com for providing the majority of movie data
- Fellow Students and Instructors at Galvanize gSchool / Zipfian Academy for providing the tools and background necessary to complete this project.