This project predicts house prices using a machine learning model trained on real estate data. The dataset contains various features such as location, number of rooms, population, and median income to help predict the median house value.
If you haven't already installed Jupyter Lab, use:
pip install jupyterlab
git clone https://github.com/BarraHarrison/House-Price-Prediction.git
cd House-Price-Prediction
jupyter lab
Ensure you have the required libraries installed:
pip install pandas numpy matplotlib seaborn scikit-learn
The project starts by importing key libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
The dataset is loaded using:
data = pd.read_csv("housing.csv")
- Checking dataset structure:
data.info() data.describe()
- Checking for missing values:
data.isnull().sum()
- Visualizing relationships:
- Histograms
- Correlation heatmaps
- Scatter plots
- Handling missing values by filling with median:
data.fillna(data.median(), inplace=True)
- Encoding categorical variables:
data = pd.get_dummies(data, columns=['ocean_proximity'])
from sklearn.model_selection import train_test_split
X = data.drop("median_house_value", axis=1)
y = data["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
- R² Score
forest.score(X_test, y_test)
- Hyperparameter Tuning Using GridSearchCV
from sklearn.model_selection import GridSearchCV param_grid = { "n_estimators": [3, 10, 30], "max_features": [2, 4, 6, 8] } grid_search = GridSearchCV(forest, param_grid, cv=5, scoring="neg_mean_squared_error", return_train_score=True) grid_search.fit(X_train, y_train)
- Test additional regression models (XGBoost, Gradient Boosting, etc.)
- Feature scaling for better performance
- Deploy model using Flask or FastAPI
This project demonstrates data exploration, feature engineering, and machine learning modeling using Jupyter Lab. The Random Forest model provides a strong baseline, and further optimizations can improve accuracy.
During this project, I used several commands for the first time and found similarities with SQL operations used in server-side programming:
data.dropna()
→ This removes missing values from the dataset, similar toWHERE column IS NOT NULL
in SQL.train_test_split()
→ This splits the dataset into training and testing subsets, similar to usingLIMIT
andOFFSET
in SQL queries to segment data.plt.figure()
→ This sets up the figure for visualization, which is akin to structuring query results before displaying them in web applications.sns.heatmap()
→ This visualizes correlations between variables, much like using SQL aggregate functions andGROUP BY
to analyze relationships between different fields.pd.get_dummies()
→ This encodes categorical variables into a numerical format, similar to usingCASE WHEN
orJOIN
operations in SQL to transform categorical data into structured numeric values.
This experience deepened my understanding of how data transformations in Python mirror SQL operations used in backend web development.