For an exploratory study of various machine learning models. Analysis and precprocessing of data may often be avoided since this study focuses on usage of various sklearn models with default configurations.
Explores performance of various regression models.
After importing the regression.py
module, you can call the function estimator(X, Y)
and get the best regression model for the dataset (X, Y)
.
The parameter X
is a pandas dataframe of input features and Y
is a pandas series of the target variable.
Running the regression.py
module as it is will give the following sample output which uses the classic California Housing Prices Dataset.
<class 'sklearn.linear_model._base.LinearRegression'>:
Training MSE : 0.4415847034707834
Validation MSE : 0.46836770999839755
<class 'sklearn.ensemble._bagging.BaggingRegressor'>:
Training MSE : 0.0495235228848702
Validation MSE : 0.2809632860237471
<class 'sklearn.ensemble._forest.RandomForestRegressor'>:
Training MSE : 0.03442765202858801
Validation MSE : 0.25679743116303017
<class 'sklearn.svm._classes.LinearSVR'>:
Training MSE : 2.126105308335331
Validation MSE : 2.0886867115616874
<class 'sklearn.neighbors._regression.KNeighborsRegressor'>:
Training MSE : 0.2680161881869106
Validation MSE : 0.42012969031690883
Best model: <class 'sklearn.ensemble._forest.RandomForestRegressor'>
Test MSE: 0.25317961771776054
In the current setting, this output is reproducible when California Housing Prices Dataset and the chosen random seed (10) are used.4
Explores performance of various classification models.
After importing the classification.py
module, you can call the function estimator(X, Y)
and get the best classification model for the dataset (X, Y)
.
The parameter X
is a pandas dataframe of input features and Y
is a pandas series of the target label.
(Multi-class classification is assumed.)
Running the classification.py
module as it is will give the following sample output which uses the classic 20-News-Groups Dataset.
<class 'sklearn.linear_model._logistic.LogisticRegression'>:
Training F1 score : 0.892072528976155
Validation F1 score : 0.8261910103702513
<class 'sklearn.ensemble._bagging.BaggingClassifier'>:
Training F1 score : 0.9883881204315708
Validation F1 score : 0.7191521215859012
<class 'sklearn.ensemble._forest.RandomForestClassifier'>:
Training F1 score : 0.999933668957722
Validation F1 score : 0.8263376379540432
<class 'sklearn.svm._classes.LinearSVC'>:
Training F1 score : 0.9911449973287118
Validation F1 score : 0.9085398890414774
<class 'sklearn.neighbors._classification.KNeighborsClassifier'>:
Training F1 score : 0.7839324365743485
Validation F1 score : 0.627218602829993
Best model: <class 'sklearn.svm._classes.LinearSVC'>
Test F1-score: 0.9111325934890819