Predict sales prices and practice feature engineering, RFs, and gradient boosting
House Prices - Advanced Regression Techniques
The Dataset has With 79 explanatory variables describing (almost) every aspect of residential homes in Ames and Iowa.
- SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale
Libraries: sklearn
Matplotlib
pandas
seaborn
NumPy
Scipy
XGBoost
lightgbm
Here, we have analyzed each discrete variable individually and make decisions based on correlation with SalePrice.
I have included Data visualizatio for some these variables, for complete analysis hace alook at : Notebook
Here we are going to analyze correlation of each feature with SalePrice, see skewness for linear and boxcox transformations and apply the better one. If skewness continues high, we will bin the variable into categories or flag (0 and 1). And if there are missing values, we will drop the column.
Transforming this feature into categories
We have transformed several continuous features in the same way, for more details have a look at the Notebook
Here, we will first change the strings by integers, because the variables are ordinal, we can't get dummies unless the correlation with SalePrice is very low. Correlation, missing values are taken into consideration.
Here, we will analyze correlation with the boxplots and missing values. Clustering information when necessary from categories, decisions will be made to drop, flag or keep the column.
In this section, we are going to get dummies for categorical variables, split train and test sets, analyze skewness if yet present for both sets, scale the data (Robust is better for outliers) and, finally, train the model for
- Lasso
- ElasticNet
- Kernel Ridge
- Gradient Boosting Regressor
- XGBoost
- Light Gradient Boosting
We've obtained the tuning parameters from GridSearchCV before training the model we have done following steps
- Analyzing skewness and replacing column with the Boxcox transformation
- Scaling with RobustScaler
- We've used log transformation for Target variable
Lasso
lasso = Lasso(alpha= 0.0005)
Elastic Net
elastic = ElasticNet(alpha=0.0005, l1_ratio=.9)
Kernel Ridge
k_ridge = KernelRidge(alpha=0.1, coef0=2.5, degree=3, gamma=None, kernel='polynomial',kernel_params=None)
Gradient Boost
g_boost = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse',
init=None, learning_rate=0.01, loss='huber',
max_depth=3, max_features=18, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=3, min_samples_split=5,
min_weight_fraction_leaf=0.0, n_estimators=3300,
n_iter_no_change=None,random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)
XGBoost
xg_boost = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
importance_type='gain', learning_rate=0.01, max_delta_step=0,
max_depth=4, min_child_weight=1, missing=None, n_estimators=3300,
n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=0.3, verbosity=1)
Light GBM
lgbm = LGBMRegressor(bagging_fraction=0.6, bagging_freq=9, bagging_seed=14,
boosting_type='gbdt', class_weight=None, colsample_bytree=0,
feature_fraction=0.1, feature_fraction_seed=1,
importance_type='split', learning_rate=0.01, max_bin=65,
max_depth=4, min_child_samples=20, min_child_weight=0.001,
min_data_in_leaf=5, min_split_gain=0.0, min_sum_hessian_in_leaf=5,
n_estimators=3300, n_jobs=-1, num_leaves=7,
objective='regression', random_state=None, reg_alpha=0.2,
reg_lambda=0.1, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
stacking = StackingRegressor(regressors=(elastic, g_boost, k_ridge),
meta_regressor = lasso)
param_grid = {}
stack = GridSearchCV(stacking,
param_grid = param_grid,
cv = 10,
scoring = "neg_mean_squared_error",
n_jobs = 5,
verbose = 1)
stack.fit(X_train,Y_train)
best score: -0.01181972103207294
Voting Regressor A voting regressor is an ensemble meta-estimator that fits base regressors each on the whole dataset. It, then, averages the individual predictions to form a final prediction.
voting = VotingRegressor(estimators=[('xgboost', xg_boost),
('lgbm', lgbm),
('stacking', stacking)])
v_param_grid = {} # tuning voting parameter
gsV = GridSearchCV(voting,
param_grid = v_param_grid,
cv = 10,
scoring = "neg_mean_squared_error",
n_jobs = 5,
verbose = 1)
gsV.fit(X_train,Y_train)
best score:-0.011459485092086497
For Hyperparameter Tuning we've used GridSearchCv.
Data Imputation
Handling Outliers
Feature Engineering
Advanced regression techniques
Stacking
Voting
If you have any feedback, please reach out at [email protected]
I am an AI Enthusiast and Data science & ML practitioner