This project focuses on cleaning, transforming, and analyzing the IMDB 5000 Movie Dataset to extract meaningful insights about the movie industry โ including relationships between budget, revenue, genres, and ratings.
This notebook demonstrates the complete data wrangling and EDA (Exploratory Data Analysis) process on a real-world dataset. It showcases skills in Pandas, Matplotlib, and Seaborn, covering data cleaning, feature understanding, and visualization.
โข Data Cleaning
โ Handling missing values
โ Detecting and removing duplicates
โ Converting data types
โ Fixing inconsistent or invalid entries
โข Exploratory Analysis
โ Descriptive statistics (mean, median, mode)
โ Genre-wise and director-wise performance
โ Correlation analysis between key numerical features
โ Identifying high-grossing vs. low-grossing films
โข Visualization
โ Revenue vs. Budget scatterplots
โ IMDb rating distributions
โ Top directors and genres by average revenue
โ Correlation heatmaps for key metrics
โข Python: Data structures, logic, functions, and lambda expressions
โข Pandas: Cleaning, transforming, merging, and aggregating data
โข Matplotlib & Seaborn: Plotting trends, distributions, and correlations
โข Analytical Thinking: Asking data-driven questions and validating hypotheses
Install required libraries:
pandas, numpy, matplotlib, seaborn, jupyter
Name: IMDB 5000 Movie Dataset
Source: Kaggle โ IMDB 5000 Movie Dataset
Format: CSV
Contains information about 5,000 movies, including:
โ Director names
โ Actor details
โ Budget and gross revenue
โ IMDb score and genres
- Clone this repository
- Open the Jupyter Notebook file named:
IMDB_5000_Movie_Dataset_Data_Cleaning_&_Exploratory_Analysis_Practice.ipynb - Run all cells sequentially to reproduce the analysis
โข Correlation heatmap showing relationships between budget, gross, and rating
โข Bar charts of top directors by average revenue
โข Genre-based performance visualizations
Revenue vs Budget plot
Top Directors chart
Potential extensions for this project:
โข Perform feature engineering (extract release year, duration bins, etc.)
โข Apply machine learning to predict movie revenue or IMDb rating
โข Use deep learning (DL) models for text-based features such as plot keywords
โข Build an interactive dashboard using Streamlit or Plotly Dash
This project is open-source and available under the MIT License.
โญ If you found this project helpful, please consider giving it a star on GitHub!