Ensemble Learning is often used to combine models.
First, let's see the Hard Voting Classifier.
# Set seed for reproducibility
SEED=1
# Instantiate lr
lr = LogisticRegression(random_state=SEED)
# Instantiate knn
knn = KNN(n_neighbors=27)
# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)
# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:
# Fit clf to the training set
clf.fit(X_train, y_train)
# Predict y_pred
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Evaluate clf's accuracy on the test set
print('{:s} : {:.3f}'.format(clf_name, accuracy))
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier
# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)
# Fit vc to the training set
vc.fit(X_train, y_train)
# Evaluate the test set predictions
y_pred = vc.predict(X_test)
# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))
Then, let's see the Bagging method. Bagging=Bootstrap Aggregation. Bootstrap method is not for combining different kinds of models. It trains using the same algorithm on different subsets of data.
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier
# Instantiate dt
dt = DecisionTreeClassifier(max_depth=8,min_samples_leaf=8,random_state=1)
# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, oob_score=True, random_state=1,n_jobs=-1)
# Fit bc to the training set
bc.fit(X_train, y_train)
# Predict test set labels
y_pred = bc.predict(X_test)
# Evaluate test set accuracy
acc_test = accuracy_score(y_test, y_pred)
# Evaluate OOB accuracy
acc_oob = bc.oob_score_
# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))
Third, random forests. In Bagging, base estimators can be anything. Random forests use only decision trees. Similar to Bagging, each estimator is trained on a different bootstrap sample (same size as the training set). Unlike Bagging, RF introduces further randomization by sampling features at each node without replacement.
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
# Instantiate rf
rf = RandomForestRegressor(n_estimators=25,
random_state=2)
# Fit rf to the training set
rf.fit(X_train, y_train)
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE
# Predict the test set labels
y_pred = rf.predict(X_test)
# Evaluate the test set RMSE
rmse_test = MSE(y_test,y_pred) ** (1/2)
# Print rmse_test
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
# Create a pd.Series of features importances
importances = pd.Series(data=rf.feature_importances_,
index= X_train.columns)
# Sort importances
importances_sorted = importances.sort_values()
# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='barh', color='lightgreen')
plt.title('Features Importances')
plt.show()
Forth, Boosting. Many week predictors trained in sequence, each learn errors from before. Unlike RF, Boosting trees are in sequence not independent of each other.
AdaBoost.
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)
# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)
# Fit ada to the training set
ada.fit(X_train,y_train)
# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:,1]
# Import roc_auc_score
from sklearn.metrics import roc_auc_score
# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)
# Print roc_auc_score
print('ROC AUC score: {:.2f}'.format(ada_roc_auc))
Gradient Boosting. Like adaBoost, estimators are trained in sequence, each trying to correct the previous errors. While AdaBoost change weights of samples in training sequence, Gradient Boosting use weighted residuals during the sequence training process.
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4,
n_estimators=200,
random_state=2)
# Fit gb to the training set
gb.fit(X_train,y_train)
# Predict test set labels
y_pred = gb.predict(X_test)
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE
# Compute MSE
mse_test = MSE(y_test,y_pred)
# Compute RMSE
rmse_test = mse_test ** 0.5
# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))
Stochastic Gradient Boosting. Each tree is trained on a random subset of rows of training data.
- The sampled instances(40%-80% of training set) are sampled without replacement. (Unlike Gradient Boosting)
- Features are sampled (without replacement) when choosing split points. (Like Gradient Boosting)
- Result: further ensemble diversity.
- Effect: adding further variance to the ensemble of trees.
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Instantiate sgbr
sgbr = GradientBoostingRegressor(max_depth=4,
subsample=0.9,
max_features=0.75,
n_estimators=200,
random_state=2)
# Fit sgbr to the training set
sgbr.fit(X_train,y_train)
# Predict test set labels
y_pred = sgbr.predict(X_test)
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE
# Compute test set MSE
mse_test = MSE(y_test,y_pred)
# Compute test set RMSE
rmse_test = mse_test ** 0.5
# Print rmse_test
print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))
General Approaches:
- Grid Search
- Random Search
- Bayesian Optimization
- Genetic Algorithms
- ...
We first discuss grid search.
Grid Search:
- Manually set a grid of discrete hyperparameter values.
- Set a metric for scoring model performance
- Search exhaustively through the grid.
- For each set of hyperparameters, evaluate each model's CV score.
- The optimal hyperparameters are those of the model achieving the best CV score.
Let's tune a CART.
# dt is a tree
# check hyper-parameters
dt.get_params
# Define params_dt
params_dt = {'max_depth':[2,3,4],
'min_samples_leaf': [0.12,0.14,0.16,0.18]
}
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
param_grid=params_dt,
scoring='roc_auc',
cv=5,
n_jobs=-1)
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score
# Extract the best estimator
best_model = grid_dt.best_estimator_
# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]
# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test,y_pred_proba)
# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))
Let's tune a RF.
# Check params
rf.get_params
# Define the dictionary 'params_rf'
params_rf = {'n_estimators':[100,350,500],
'max_features':['log2','auto','sqrt'],
'min_samples_leaf': [2,10,30]
}
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
param_grid=params_rf,
scoring='neg_mean_squared_error',
cv=3,
verbose=1,
n_jobs=-1)
grid_rf.fit(X_train,y_train)
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE
# Extract the best estimator
best_model = grid_rf.best_estimator_
# Predict test set labels
y_pred = best_model.predict(X_test)
# Compute rmse_test
rmse_test = MSE(y_test,y_pred) ** (1/2)
# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))
Example from a task to do supervised learning using both text and numeric data.
pl = Pipeline([
('union', FeatureUnion(
transformer_list = [
('numeric_features', Pipeline([
('selector', get_numeric_data),
('imputer', Imputer())
])),
('text_features', Pipeline([
('selector', get_text_data),
('vectorizer', HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC, non-negative=True, norm=None, binary=False,
ngram_range=(1, 2))),
('dim_red', SelectKBest(chi2, chi_k))
]))
]
)),
('int', PolynomialFeatures(degree=2)),
('scale', MaxAbsScaler()),
('clf', OneVsRestClassifier(LogisticRegression()))
])