Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate and Improve Feature Engineering #667

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ slides/supervised-regression/slides-*.pdf
slides/trees/slides-*.pdf
slides/tuning/slides-*.pdf
slides/boosting/slides-*.pdf
slides/feature-engineering/slides-*.pdf
# vim swap files
*.swp
# used for atom editor
Expand Down
1 change: 1 addition & 0 deletions slides/feature-engineering/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
include ../tex.mk
25 changes: 25 additions & 0 deletions slides/feature-engineering/chapter-order.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
%Suggested order of slides
% slides-feature-eng-intro.tex
% slides-feature-eng-trafo.tex
% slides-feature-eng-imputation.tex
% slides-feature-eng-functional-data.tex
% slides-feature-eng-outlier.tex
% slides-feature-eng-practical.tex

\subsection{Introduction}
\includepdf[pages=-]{../slides-pdf/slides-feature-eng-intro.pdf}

\subsection{Feature and Target Transformation}
\includepdf[pages=-]{../slides-pdf/slides-feature-eng-trafo.pdf}

\subsection{Imputation}
\includepdf[pages=-]{../slides-pdf/slides-feature-eng-imputation.pdf}

\subsection{Functional Features}
\includepdf[pages=-]{../slides-pdf/slides-feature-eng-functional-data.pdf}

\subsection{Outliers}
\includepdf[pages=-]{../slides-pdf/slides-feature-eng-outlier.pdf}

\subsection{Practical Feature Engineering}
\includepdf[pages=-]{../slides-pdf/slides-feature-eng-practical.pdf}
2,931 changes: 2,931 additions & 0 deletions slides/feature-engineering/data/ames_housing_extended.csv

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/feature-engineering/figure_man/custom.pdf
Binary file not shown.
Binary file added slides/feature-engineering/figure_man/custom.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/feature-engineering/figure_man/dag.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/feature-engineering/figure_man/dbscan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added slides/feature-engineering/figure_man/extract.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added slides/feature-engineering/figure_man/n-to-1.pdf
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file added slides/feature-engineering/figure_man/tree.pdf
Binary file not shown.
Binary file added slides/feature-engineering/figure_man/tree.png
Binary file added slides/feature-engineering/figure_man/z-score.png
39 changes: 39 additions & 0 deletions slides/feature-engineering/rsrc/ames-encoding.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
library(tidyverse)
library(mlr)
library(mlrCPO)
library(parallelMap)

parallelStartSocket(7)

data = read.csv("data/ames_housing_extended.csv")

task = data %>%
select(SalePrice, MS.Zoning, Street, Lot.Shape, Land.Contour, Bldg.Type) %>%
makeRegrTask(id = "None", target = "SalePrice") %>>% cpoFixFactors()

task1 = createDummyFeatures(task, method = "1-of-n")
task1$task.desc$id = "One-Hot"

task2 = createDummyFeatures(task, method = "reference")
task2$task.desc$id = "Dummy"

lrns = list(
makeLearner(id = "Linear Regression", "regr.lm"),
makeLearner(id = "Random Forest", "regr.ranger"))

set.seed(1)
rin = makeResampleInstance(cv10, task1)

res = benchmark(lrns, list(task1, task2, task), rin, mae)

pl = as.data.frame(res) %>%
filter(task.id != "None" | learner.id != "Linear Regression") %>%
ggplot(aes(y = mae, x = task.id)) +
geom_boxplot() +
facet_wrap(~learner.id) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 20, hjust = 1)) +
ylab("Mean Absolute Error") +
xlab("") + ggtitle("Ames House Price Prediction")

ggsave("figure_man/ames-encoding.png", width = 6, height = 5)
21 changes: 21 additions & 0 deletions slides/feature-engineering/rsrc/ggsave-1.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@


library(mlr)
library(mlrMBO)
library(ggplot2)
library(gridExtra)
library(reshape2)
library(kernlab)
library(mvtnorm)
library(gptk)
library(smoof)


library(ggplot2)
d = data.frame(
Method = factor(1:4, labels = c("Linear Regression", "Gradient Boosting", "Linear Regression w. Feat. Eng.", "Gradient Boosting w. Feat. Eng.")),
Error = c(25.5, 10, 11, 9.8)
)
ggplot(data = d) + geom_bar(aes(x = Method, y = Error), stat = "identity") + theme_minimal() + theme(axis.text.x = element_text(angle = 15, hjust = 1))

############################################################################
190 changes: 190 additions & 0 deletions slides/feature-engineering/rsrc/ggsave-2.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@

library(knitr)
library(mlr)
library(mlrMBO)
library(ggplot2)
library(gridExtra)
library(reshape2)
library(kernlab)
library(mvtnorm)
library(gptk)
library(smoof)


library(mlr)
library(mlrCPO)
library(ggplot2)
library(dplyr)
theme_set(theme_minimal())
#ap = adjust_path(getwd())
data = readr::read_csv("data/ames_housing_extended.csv")
data = data[, ! grepl(pattern = "energy_t", x = names(data))]

data_new = data
names(data_new) = make.names(names(data_new))
data_new = data_new %>%
dplyr::select(-X1, -Fence, -Pool.QC, -Misc.Feature, -Alley) %>%
select_if(is.numeric) %>%
na.omit()

df_plot = data.frame(x = data_new$SalePrice)

gg1_dist = ggplot(df_plot, aes(x)) +
geom_histogram(aes(y = stat(density)), color = "white", bins = 40L) +
stat_function(fun = dnorm, col = "red",
args = list(mean = mean(df_plot$x), sd = sd(df_plot$x))) +
xlab("Sale Price") +
ylab("Density")

df_plot = data.frame(x = log(data_new$SalePrice))

gg2_dist = ggplot(df_plot, aes(x)) +
geom_histogram(aes(y = stat(density)), color = "white", bins = 40L) +
stat_function(fun = dnorm, col = "red", args = list(mean = mean(df_plot$x), sd = sd(df_plot$x))) +
xlab("Log Sale Price") +
ylab("Density")

task = removeConstantFeatures(makeRegrTask(id = "Ames Housing", data = data_new, target = "SalePrice"))
lrn = makeLearner("regr.lm")
lrn$id = "No Trafo"
lrn_loglm = cpoLogTrafoRegr() %>>% makeLearner("regr.lm")
lrn_loglm$id = "Log Trafo"
mod = train(learner = "regr.lm", task = task)
mod_log = train(learner = lrn_loglm, task = task)
target = data_new$SalePrice

pred_mod = predict(mod, task)$data$response
pred_mod_log = predict(mod_log, task)$data$response
df_plot = data.frame(target, pred_mod, pred_mod_log)

gg1_pred = ggplot(data = df_plot, aes(x = target, y = pred_mod)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
xlab("Sale Price") +
ylab("Predicted Sale Price")

gg2_pred = ggplot(data = df_plot, aes(x = target, y = pred_mod_log)) +
geom_point(alpha = 0.2) +
geom_abline(intercept = 0, slope = 1, color = "red") +
xlab("Sale Price") +
ylab("exp(Predicted Log Sale Price)")

gridExtra::grid.arrange(gg1_dist, gg1_pred, ncol = 2)

##########################################################

gridExtra::grid.arrange(gg2_dist, gg2_pred, ncol = 2)
#########################################################

set.seed(31415)
rdesc = makeResampleInstance(desc = cv10, task = task)
bmr = benchmark(learners = list(lrn, lrn_loglm), tasks = task, resamplings = rdesc, measures = mae)
plotBMRBoxplots(bmr, pretty.names = FALSE)

###################################################

lrn_ranger = makeLearner("regr.ranger", num.trees = 200L, mtry = 3L)
lrn_ranger$id = "RF No Trafo"
lrn_logranger = cpoLogTrafoRegr() %>>% lrn_ranger
lrn_logranger$id = "RF Log Trafo"
bmr = benchmark(learners = list(lrn, lrn_loglm, lrn_ranger, lrn_logranger), tasks = task, resamplings = rdesc, measures = mae)
plotBMRBoxplots(bmr, pretty.names = FALSE)

#######################################################

data = readr::read_csv("data/ames_housing_extended.csv")
data = data[, ! grepl(pattern = "energy_t", x = names(data))]

data_new = data
names(data_new) = make.names(names(data_new))
task = data_new %>%
dplyr::select(-X1, -Fence, -Pool.QC, -Misc.Feature, -Alley) %>%
select_if(is.numeric) %>%
na.omit() %>%
makeRegrTask(id = "Ames Housing", data = ., target = "SalePrice") %>%
removeConstantFeatures()

lrn_kknn_no_scale = makeLearner("regr.kknn", scale = FALSE)
lrn_kknn_no_scale$id = "No Scaling"
lrn_kknn_scale = mlrCPO::cpoScale() %>>% makeLearner("regr.kknn", scale = FALSE)
lrn_kknn_scale$id = "Normalize Features"
lrn_kknn_boxcox = makePreprocWrapperCaret(lrn_kknn_no_scale, ppc.BoxCox = TRUE)
lrn_kknn_boxcox$id = "Box-Cox Trafo"

set.seed(31415)
bmr = benchmark(learners = list(lrn_kknn_no_scale, lrn_kknn_scale, lrn_kknn_boxcox),
tasks = task, resamplings = cv10, measures = mae)

plotBMRBoxplots(bmr, pretty.names = FALSE)

###################################################

library(mlr)

data = read.csv("data/ames_housing_extended.csv")

data %>%
dplyr::select(SalePrice, Central.Air, Bldg.Type) %>%
slice(5:9) %>%
knitr::kable(format = 'latex') %>%
kableExtra::kable_styling(latex_options = 'HOLD_position', font_size = 7)

##################################################

data = read.csv("data/ames_housing_extended.csv")
data %>%
dplyr::select(SalePrice, Central.Air, Bldg.Type) %>%
slice(5:9) %>%
createDummyFeatures(target = "SalePrice") %>%
knitr::kable(format = 'latex') %>%
kableExtra::kable_styling(latex_options = 'HOLD_position', font_size = 4)

###############################################

library(dplyr)
data = read.csv("data/ames_housing_extended.csv")

data %>%
filter(!is.na(Foundation)) %>%
dplyr::select(Foundation) %>%
group_by(Foundation) %>%
tally(name = "nk") %>%
t() %>%
knitr::kable(format = 'latex') %>%
kableExtra::kable_styling(latex_options = 'HOLD_position', font_size = 7)
####################################################

data %>%
mutate(house.id = row_number()) %>%
dplyr::select(house.id, SalePrice, Foundation) %>%
filter(Foundation == "Wood") %>%
t() %>%
knitr::kable(format = 'latex') %>%
kableExtra::kable_styling(latex_options = 'HOLD_position', font_size = 7)
###################################################

data %>%
dplyr::select(SalePrice, Foundation) %>%
group_by(Foundation) %>%
dplyr::summarize(`Foundation(enc)` = mean(SalePrice, na.rm = TRUE)) %>%
t() %>%
knitr::kable(format = 'latex') %>%
kableExtra::kable_styling(latex_options = 'HOLD_position', font_size = 7)

#################################################
















86 changes: 86 additions & 0 deletions slides/feature-engineering/rsrc/ggsave-3.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@

library(knitr)
library(mlr)
library(mlrMBO)
library(ggplot2)
library(gridExtra)
library(reshape2)
library(kernlab)
library(mvtnorm)
library(gptk)
library(smoof)
library(tidyverse)
library(stringi)

data = read_csv("data/ames_housing_extended.csv")
set.seed(2)

data %>%
dplyr::select(matches("energy"), X1) %>%
sample_n(8) %>%
gather("Minute", "Energy_Consumption", -X1) %>%
mutate(Minute = as.numeric(stri_replace(Minute, "", fixed = "energy_t"))) %>%
ggplot(aes(x = Minute, y = Energy_Consumption, group = X1)) +
geom_line() +
facet_grid(X1~.) +
theme_minimal() +
ylab("Energy Consumption") +
xlab("Minute of Day")

########################################################

library(tidyverse)

data = read_csv("data/ames_housing_extended.csv")

data %>%
mutate(house.id = X1) %>%
dplyr::select(house.id, matches("energy")) %>%
gather(name, value, -house.id) %>%
group_by(house.id) %>%
dplyr::summarize(mean.energy = mean(value, na.rm = TRUE),
var.energy = var(value, na.rm = TRUE),
max.energy = max(value, na.rm = TRUE)) %>%
sample_n(5) %>%
mutate("..." = rep("...", times = 5)) %>%
knitr::kable(format = 'latex') %>%
kableExtra::kable_styling(latex_options = 'HOLD_position', font_size = 7)

##############################################

library(tidyverse)
library(stringi)
library(mlr)

data = read_csv("data/ames_housing_extended.csv")

task = data %>%
mutate(lot_area = `Lot Area`) %>%
dplyr::select(SalePrice, lot_area , matches("energy")) %>%
na.omit %>%
makeFunctionalData(fd.features = list("Energy" = 3:ncol(.))) %>%
makeRegrTask(" ", data = ., target = "SalePrice")
feat.methods = list("Energy" = extractFDAWavelets(filter = "haar"))
lrns = list(
makeLearner(id = "Boosted Linear Model", "regr.glmboost"),
makeExtractFDAFeatsWrapper(makeLearner("regr.glmboost"), feat.methods = feat.methods),
makeLearner(id = "Boosted Functional Linear Model", "regr.FDboost")
)

lrns[[2]]$id = "Boosted Linear Model with Wavelets"
if (FALSE) {
set.seed(12)
b1 = benchmark(lrns, task, cv10, measure = mae, keep.pred = FALSE, models = FALSE)
saveRDS(b1, file = "benchmark_cache/ames.rds")
} else {
b1 = readRDS("benchmark_cache/ames.rds")
}

plotBMRBoxplots(b1, pretty.names = FALSE) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 20, hjust = 1)) +
ylab("Mean Absolute Error") +
xlab("") + ggtitle("House Price Prediction")

############################################

Loading