This dataset is a collection of movie data that contains variables such as budget, genres homepage, original language, popularity, production countries, release date and spoken language.
All the required packages along with their version are in the requirements.txt
. They can be easily installed with following command:
pip3 install -r requirements.txt
Since the genre column contains a string representation of a list, regex has been used to extract the generes from the text and put the results into a list. The reason, I have chosen lists is that a movie can have multiple genres. In addition, a scikit-learn compatible custom transformer have been developed for this and other variables onward. Then CountVectorizer has been used for to convert the movie genres to one hot encoded values. Result of genre cleaner:
array([list(['Comedy']), list(['Comedy', 'Drama', 'Family', 'Romance']),
list(['Drama']), ...,
list(['Crime', 'Action', 'Mystery', 'Thriller']),
list(['Comedy', 'Romance']),
list(['Thriller', 'Action', 'Mystery'])], dtype=object)
the original dataset release dates had two digit format which can cause ambiguity because we have years from both 20th and 21th century. I converted the years to 4 digit format and developed my model based on it. In this way, my model can be used for any year in 21th century. Moreover, I extracted the year and month of release dates to be used as features in my models
Since this dataset needed a lot of data cleaning and feature engineering, please consult the jupyter notebook for details of worked done for other variables.
The distribution of movie revenues:
The revenue VS original language:
The revenue VS release months:
Features correlation with target:
The criteria prescribed in the Kaggle competition is the sqaure root of mean squared log error which is already available in Scikit-Learn.
Various models have been used including Random Forest, Gradient Boosting, XGBoost and so on. It was found that Random Forest gives us the best result. The best metric value I could reach is 2.39 on the test set.