🎬 IMDB 5000 Movie Dataset — Data Cleaning & Exploratory Analysis

This project focuses on cleaning, transforming, and analyzing the IMDB 5000 Movie Dataset to extract meaningful insights about the movie industry — including relationships between budget, revenue, genres, and ratings.

📊 Project Overview

This notebook demonstrates the complete data wrangling and EDA (Exploratory Data Analysis) process on a real-world dataset. It showcases skills in Pandas, Matplotlib, and Seaborn, covering data cleaning, feature understanding, and visualization.

Key Steps

• Data Cleaning
– Handling missing values
– Detecting and removing duplicates
– Converting data types
– Fixing inconsistent or invalid entries

• Exploratory Analysis
– Descriptive statistics (mean, median, mode)
– Genre-wise and director-wise performance
– Correlation analysis between key numerical features
– Identifying high-grossing vs. low-grossing films

• Visualization
– Revenue vs. Budget scatterplots
– IMDb rating distributions
– Top directors and genres by average revenue
– Correlation heatmaps for key metrics

🧠 Skills Demonstrated

• Python: Data structures, logic, functions, and lambda expressions
• Pandas: Cleaning, transforming, merging, and aggregating data
• Matplotlib & Seaborn: Plotting trends, distributions, and correlations
• Analytical Thinking: Asking data-driven questions and validating hypotheses

📦 Requirements

Install required libraries:
pandas, numpy, matplotlib, seaborn, jupyter

📂 Dataset

Name: IMDB 5000 Movie Dataset
Source: Kaggle — IMDB 5000 Movie Dataset
Format: CSV
Contains information about 5,000 movies, including:
– Director names
– Actor details
– Budget and gross revenue
– IMDb score and genres

🚀 How to Use

Clone this repository
Open the Jupyter Notebook file named:
IMDB_5000_Movie_Dataset_Data_Cleaning_&_Exploratory_Analysis_Practice.ipynb
Run all cells sequentially to reproduce the analysis

📊 Example Outputs

• Correlation heatmap showing relationships between budget, gross, and rating
• Bar charts of top directors by average revenue
• Genre-based performance visualizations

Revenue vs Budget plot
Top Directors chart

🔮 Future Work

Potential extensions for this project:
• Perform feature engineering (extract release year, duration bins, etc.)
• Apply machine learning to predict movie revenue or IMDb rating
• Use deep learning (DL) models for text-based features such as plot keywords
• Build an interactive dashboard using Streamlit or Plotly Dash

🏷️ License

This project is open-source and available under the MIT License.

⭐ If you found this project helpful, please consider giving it a star on GitHub!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
IMDB_5000_Movie_Dataset_Data_Cleaning_&_Exploratory_Analysis_Practice.ipynb		IMDB_5000_Movie_Dataset_Data_Cleaning_&_Exploratory_Analysis_Practice.ipynb
ReadMe.md		ReadMe.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 IMDB 5000 Movie Dataset — Data Cleaning & Exploratory Analysis

📊 Project Overview

Key Steps

🧠 Skills Demonstrated

📦 Requirements

📂 Dataset

🚀 How to Use

📊 Example Outputs

🔮 Future Work

🏷️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎬 IMDB 5000 Movie Dataset — Data Cleaning & Exploratory Analysis

📊 Project Overview

Key Steps

🧠 Skills Demonstrated

📦 Requirements

📂 Dataset

🚀 How to Use

📊 Example Outputs

🔮 Future Work

🏷️ License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages