-
Notifications
You must be signed in to change notification settings - Fork 36
/
bodyfat.Rmd
514 lines (450 loc) · 23 KB
/
bodyfat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
---
title: "Projection predictive variable selection – A review and recommendations for the practicing statistician"
author: "[Aki Vehtari](https://users.aalto.fi/~ave/)"
date: "First version 2018-03-06. Last modified `r format(Sys.Date())`."
output:
html_document:
fig_caption: yes
toc: TRUE
toc_depth: 2
number_sections: TRUE
toc_float:
smooth_scroll: FALSE
bibliography: modelsel.bib
csl: harvard-cite-them-right.csl
link-citations: yes
---
# Setup {.unnumbered}
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=FALSE, message=FALSE, error=FALSE, warning=FALSE, comment=NA, out.width='95%')
```
**Load packages**
```{r}
library(here)
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(projpred)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(corrplot)
library(knitr)
SEED=1513306866
```
# Introduction
This notebook was inspired by the article [Heinze, Wallisch, and
Dunkler (2018). Variable selection – A review and recommendations for
the practicing statistician](https://doi.org/10.1002/bimj.201700067).
They provide ``an overview of various available variable selection
methods that are based on significance or information criteria,
penalized likelihood, the change-in-estimate criterion, background
knowledge, or combinations thereof.'' I agree that they provide
sensible recommendations and warnings for those methods.
Similar recommendations and warnings hold for information criterion
and naive cross-validation based variable selection in Bayesian
framework as demonstrated by @Piironen+Vehtari:2017a.
@Piironen+Vehtari:2017a demonstrate also the
superior stability of projection predictive variable selection (see
specially figures 4 and 10). In this notebook I demonstrate the projection
predictive variable selection method as presented by @Piironen+etal:projpred:2020 and implemented in R package
[`projpred`](https://cran.r-project.org/package=projpred). I use the
same body fat data as used in Section 3.3 of the article by Heinze,
Wallisch, and Dunkler (2017). The dataset with the background
information is available [here](https://ww2.amstat.org/publications/jse/v4n1/datasets.johnson.html)
but Heinze, Wallisch, and Dunkler have made some data cleaning and I
have used the same data and some bits of the code they provide in the
supplementary material. There still are some strange values like the
one person with zero fat percentage, but I didn't do additional
cleaning.
This notebook was initially created in 2018, but has later been extended to an article "Using reference models in variable selection" by @Pavone+etal:2020.
The excellent performance of the projection predictive variable selection comes from following parts
1. Bayesian inference using priors and integration over all the uncertainties makes it easy to get good predictive performance with all variables included in the model [@Piironen+Vehtari:RHS:2017; @Piironen+Vehtari:ISPC:2018; @Piironen+etal:projpred:2020]
2. Projection of the information from the full model to a smaller model is able to include information and uncertainty from the left out variables (while conditioning of the smaller model to data would ignore left out variables) [@Piironen+etal:projpred:2020; @Pavone+etal:2020].
3. During the search through the model space comparing the predictive distributions of projected smaller models to the predictive distribution of the full model reduces greatly the variance in model comparisons [@Piironen+Vehtari:2017a].
4. Even with greatly reduced variance in model comparison, the selection process slightly overfits to the data, but we can cross-validate this effect using the fast Pareto smoothed importance sampling leave-one-out cross-validation [@Vehtari+etal:PSIS-LOO:2017; @Vehtari+etal:PSIS:2019]
Excellent performance of projection predictive variable selection
compared to other Bayesian variable selection methods was presented by
@Piironen+Vehtari:2017a. @Piironen+etal:projpred:2020 present further
improvements such as improved model size selection and several options
to make the approach faster for larger number of variables or bigger
data sets. @Vehtari+Ojanen:2012 present
theoretical justification for projection predictive model selection
and inference after selection.
Note that if the goal is only the prediction no variable selection is
needed. The projection predictive variable selection can be used to
learn which are the most useful variables for making predictions and
potentially reduce the future measurement costs. In the bodyfat
example, most of the measurements have time cost and there is a
benefit of finding the smallest set of variables to be used in the
future for the predictions.
This notebook presents a linear regression example, but the projection predictive approach can be used also, for example, with generalized linear models [@Piironen+Vehtari:2017a; @Piironen+etal:projpred:2020], Gaussian processes [@Piironen+Vehtari:GP-projection:2016], and generalized linear and additive multilevel models [@Catalina+etal:projpredgamms:2020].
# Bodyfat data
Load data and scale it. Heinze, Wallisch, and Dunkler (2018) used unscaled data, but we scale it for easier comparison of the effect sizes. In theory this scaling should not have detectable difference in the predictions and I did run the results also without scaling and there is no detectable difference in practice.
```{r}
df <- read.table(here("bodyfat.txt"), header = T, sep = ";")
df[,4:19] <- scale(df[,4:19])
# no-one can have 0% body fat
df <- df[df$siri>0,]
y <- df[,"siri"]
df <- as.data.frame(df)
n <- nrow(df)
```
Lists of predictive and target variables, and formula.
```{r}
pred <- c("age", "weight", "height", "neck", "chest", "abdomen", "hip",
"thigh", "knee", "ankle", "biceps", "forearm", "wrist")
target <- "siri"
formula <- formula(paste("siri ~", paste(pred, collapse = " + ")))
p <- length(pred)
```
Plot correlation structure
```{r}
corrplot(cor(df[, c(target,pred)]))
```
# Regression model with regularized horseshoe prior
Fit full Bayesian model. We use weakly informative regularized
horseshoe prior [Piironen+Vehtari:RHS:2017] to include prior
assumption that some of the variables might be irrelevant. The
selected $p_0$ is mean for the prior assumption. See @Piironen+Vehtari:RHS:2017 for the uncertainty around the
implied prior for effectively non-zero coefficients.
```{r}
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
rhs_prior <- hs(global_scale=tau0)
fitrhs <- stan_glm(formula, data = df, prior=rhs_prior, QR=TRUE,
seed=SEED, refresh=0)
summary(fitrhs)
```
Make graphical posterior predictive check to check that the model
predictions make sense.
```{r}
yrep <- posterior_predict(fitrhs, draws = 50)
ppc_dens_overlay(y, yrep)
```
Kernel density estimate for the data and posterior predictive replicates are similar. Although data do not have any negative $y$ values, the kernel density estimate is smoothing over to the negative side. The predictive distribution of the model is noy constrained to be positive, but the error is small and we continue with the variable selection.
Plot marginal posterior of the coefficients.
```{r}
mcmc_areas(as.matrix(fitrhs)[,2:14])
```
We can see that the posterior of abdomen coefficient is far away from
zero, but it's not as clear what other variables should be included. `weight` has wide marginal overlapping zero, which hints potentially relevant variable with correlation in joint posterior.
Looking at the marginals has the problem that correlating variables may
have marginal posteriors overlapping zero while joint posterior
typical set does not include zero. Compare, for example, marginals of `height`
and `height` above to their joint distribution below.
```{r}
mcmc_scatter(as.matrix(fitrhs), pars = c("height", "weight"))+geom_vline(xintercept=0)+geom_hline(yintercept=0)
```
Projection predictive variable selection is easily made with
`cv_varsel` function, which also computes an LOO-CV estimate of the
predictive performance for the best models with certain number of
variables. Heinze, Wallisch, and Dunkler (2018) ``consider abdomen and
height as two central IVs [independent variables] for estimating body
fat proportion, and will not subject these two to variable
selection.'' We subject all variables to selection.
```{r, results='hide'}
fitrhs_cvvs <- cv_varsel(fitrhs, method = 'forward', cv_method = 'loo',
nloo = n, verbose = FALSE)
```
And the estimated predictive performance of smaller models compared to the full model.
```{r}
plot(fitrhs_cvvs, stats = c('elpd', 'rmse'), deltas=FALSE)
```
Based on the plot 2 variables and projected posterior provide
practically the same predictive performance as the full model. We can
get a PSIS-LOO [@Vehtari+etal:PSIS-LOO:2017] based recommendation for
the model size to choose.
```{r}
(nsel <- suggest_size(fitrhs_cvvs, alpha=0.1))
(vsel <- solution_terms(fitrhs_cvvs)[1:nsel])
```
Based on this recommendation we continue with two variables `abdomen`
and `weight`. The model selected by Heinze, Wallisch, and
Dunkler (2018) had seven variables `height` (fixed), `abdomen` (fixed),
`wrist`, `age`, `neck`, `forearm`, and `chest`-
Form the projected posterior for the selected model.
```{r}
projrhs <- project(fitrhs_cvvs, nv = nsel, ns = 4000)
```
Plot the marginals of the projected posterior.
```{r}
mcmc_areas(as.matrix(projrhs), pars = vsel)
```
So far we have seen that `projpred` selected a smaller set of variables
which have very similar predictive performance as the full model. Let's
compare next the stability of the approaches. Heinze, Wallisch, and
Dunkler (2018) repeated the model selection using 1000 bootstrapped
datasets. Top 20 models selected have 5--9 variables, the highest
selection frequency is 3.2%, and cumulative selection frequency for
top 20 models is 29.5%. These results clearly illustrate instability
of the selection method they used.
Before looking at the corresponding bootstrap results we can look at the
stability of selection process based on the LOO-CV selection paths
computed by `cv_varsel` (the code to make the following plot will be
included in __projpred__ package).
```{r}
source("projpredpct.R")
rows <- nrow(fitrhs_cvvs[["pct_solution_terms_cv"]])
col <- nrow(fitrhs_cvvs[["pct_solution_terms_cv"]])
pctch <- round(fitrhs_cvvs[["pct_solution_terms_cv"]], 2)
colnames(pctch)[1] <- ".size"
pct <- get_pct_arr(pctch, 13)
col_brks <- get_col_brks()
pct$val_grp <- as.character(sapply(pct$val, function(x) sum(x >= col_brks$breaks)))
if (identical(rows, 0)) rows <- pct$var[1]
pct$sel <- (pct$.size == col) & (pct$var %in% rows)
brks <- sort(unique(as.numeric(pct$val_grp)) + 1)
ggplot(pct, aes_(x = ~.size, y = ~var)) +
geom_tile(aes_(fill = ~val_grp, color = ~sel),
width = 1, height = 0.9, size = 1) +
geom_text(aes_(label = ~val, fontface = ~sel+1)) +
coord_cartesian(expand = FALSE) +
scale_y_discrete(limits = rev(levels(pct$var))) +
scale_x_discrete(limits = seq(1,col)) +
scale_color_manual(values = c("white", "black")) +
labs(x = "Model size", y = "",
title = "Fraction of cv-folds that select the given variable") +
scale_fill_manual(breaks = brks, values = col_brks$pal[brks]) +
theme_proj() +
theme(legend.position = "none",
axis.text.y = element_text(angle = 45))
```
For model sizes 1-3 selection paths in different LOO-CV cases are always the same `abdomen?, `weight`, and `wrist`. For larger model sizes there are some small variation, but mostly the order is quite consistent.
Running `stan_glm` with `prior=hs()` and `cv_varsel` do not take much time when run only once, but for a notebook running them 1000 times would take hours. The code for running the above variable selection procedure for 100 different bootstrapped datasets is as follows.
```{r}
writeLines(readLines("bodyfat_bootstrap.R"))
```
In theory LOO-CV should have smaller variation than bootstrap, and
also in practice we see much more variation in the bootstrap
results. In the basic bootstrap on average only 63% of the original
data is included, that is, in this case on average the amount of
unique observations in bootstrapped data is 159 while full data has
n=251, which explains increased variability in variable
selection. From 100 bootstrap iterations model size 2 was selected 32
times and model size 3 28 times.
The bootstrap inclusion frequencies with `projpred` and with
`step(lm(...)))` (resuls from Heinze, Wallisch, and Dunkler (2018)) are
shown in the following table
```{r echo = FALSE, results = 'asis'}
load("bodyfat_bootstrap.RData")
kable(boot_inclusion, caption = "Bootstrap inclusion probabilities.")
```
Heinze, Wallisch, and Dunkler (2018) had fixed that abdomen and height are always included. `projpred` selects abdomen always, but height is included only in 35% iterations. Coefficients of weight and height are strongly correlated as shown above, and thus it is not that surprising that `projpred` selects weight instead ogf height. Five most often selected variables by `projpred` in bootstrap iterations are the same five and in the same order as by `cv_varsel` function with full data. Overall `projpred` selects smaller models and thus the bootstrap inclusion probabilities are smaller.
The following table shows top 10 `projpred` models and model selection
frequencies from bootstrap iterations.
|model | variables | frequency|
|:---------|:----------|--------:|
| 1 |abdomen, weight| 37|
| 2 |abdomen, wrist| 10|
| 3 |abdomen, height| 10|
| 4 |abdomen, height, wrist| 9|
| 5 |abdomen, weight, wrist| 8|
| 6 |abdomen, chest, height, wrist| 3|
| 7 |abdomen, height, neck, wrist| 2|
| 8 |abdomen, age, wrist| 2|
| 9 |abdomen, age, height, neck, thigh, wrist| 2|
| 10|abdomen, chest| 1|
Table: projpred model selection frequencies in bootstrap
Top 10 models selected have 2--5 variables, the highest selection
frequency is 38%, and the cumulative selection frequency for top 10
models is 84% and for top 20 models 94%. This demonstrates that
`projpred` is much more stable.
Heinze, Wallisch, and Dunkler (2017) focused on which variables were
selected and coefficient estimates, but they did not consider the
predictive performance of the models. We can estimate the predictive
performance of the selected model via cross-validation (taking into
account the effect of the selection process, too). As part of the
variable selection `cv_varsel` computes also PSIS-LOO estimates for the
full model and all submodels taking into account the selection
process. For Bayesian models we would usually report expected log
predictive density as it assesses the goodnes of the whole predictive
distribution. Since we now compare results to `lm` we estimate also root
mean square error (rmse) of mean predictions.
```{r}
loormse_full <- sqrt(mean((df$siri-fitrhs_cvvs$summaries$ref$mu)^2))
loormse_proj <- sqrt(mean((df$siri-fitrhs_cvvs$summaries$sub[[nsel]]$mu)^2))
print(paste('PSIS-LOO RMSE Bayesian full model: ', round(loormse_full,1)))
print(paste('PSIS-LOO RMSE selected projpred model: ', round(loormse_proj,1)))
```
Since we do get these cross-validation estimates using PSIS-LOO, we do
not need to run K-fold-CV, but since we need to run K-fold-CV for lm +
step, we did run also K-fold-CV for `projpred` selection process. As this
take some time, hre only the code is shown and we load seprately run
results. The following code computes 20-fold-CV for the Bayesian model
and `projpred` selected model.
```{r}
writeLines(readLines("bodyfat_kfoldcv.R"))
```
We load and print the results
```{r}
load(file="bodyfat_kfoldcv.RData")
print(paste('20-fold-CV RMSE Bayesian full model: ', round(rmse_full,1)))
print(paste('20-fold-CV RMSE Bayesian projpred model: ', round(rmse_proj,1)))
```
The results are very close to PSIS-LOO results. Going from the full
Bayesian model to a smaller projection predictive model, we get
practically the same 20-fold-CV RMSE, which very good performance
considering we dropped from 13 covariates to 2 covariates.
Then compute 20-fold-CV for the lm model and step selected model (this
is fast enough to include in a notebook).
```{r}
set.seed(SEED)
perm <- sample.int(n)
K <- 20
idx <- ceiling(seq(from = 1, to = n, length.out = K + 1))
bin <- .bincode(perm, breaks = idx, right = FALSE, include.lowest = TRUE)
lmmuss <- list()
lmvsmuss <- list()
lmvsnvss <- list()
lmvsnlmuss <- list()
lmvsnlnvss <- list()
for (k in 1:K) {
message("Fitting model ", k, " out of ", K)
omitted <- which(bin == k)
lmfit_k <- lm(formula, data = df[-omitted,, drop=FALSE], x=T,y=T)
lmmuss[[k]] <- predict(lmfit_k, newdata = df[omitted, , drop = FALSE])
sel_k <- step(lm(formula, data = df[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
scope = list(upper = formula,
lower = formula(siri~abdomen+height)),
trace = 0)
lmvsmuss[[k]] <- predict(sel_k, newdata = df[omitted, , drop = FALSE])
lmvsnvss[[k]] <- length(coef(sel_k))-1
# compute also a version without fixing abdomen and height
selnl_k <- step(lm(formula, data = df[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
trace = 0)
lmvsnlmuss[[k]] <- predict(selnl_k, newdata = df[omitted, , drop = FALSE])
lmvsnlnvss[[k]] <- length(coef(selnl_k))-1
}
lmmus<-unlist(lmmuss)[order(as.integer(names(unlist(lmmuss))))]
lmvsmus<-unlist(lmvsmuss)[order(as.integer(names(unlist(lmvsmuss))))]
lmvsnvs <- unlist(lmvsnvss)
lmvsnlmus<-unlist(lmvsnlmuss)[order(as.integer(names(unlist(lmvsnlmuss))))]
lmvsnlnvs <- unlist(lmvsnlnvss)
rmse_lmfull <- sqrt(mean((df$siri-lmmus)^2))
rmse_lmsel <- sqrt(mean((df$siri-lmvsmus)^2))
rmse_lmselnl <- sqrt(mean((df$siri-lmvsnlmus)^2))
```
```{r}
print(paste('20-fold-CV RMSE lm full model: ', round(rmse_lmfull,1)))
print(paste('20-fold-CV RMSE lm step selected model: ', round(rmse_lmsel,1)))
```
We see that simpler maximum likelihood could provide similar 20-fold-CV
RMSE as much slower Bayesian inference with a fancy regularized
horseshoe prior. Also the model selection process did not overfit and
the model selection with step has similar 20-fold-CV RMSE as `projpred`
result. Neither of these results are not surprising as $n=251 \gg
p=13$. However `projpred` has much more stable selection process and
produced much smaller models with the same accuracy.
Heinze, Wallisch, and Dunkler (2018) also write ``In routine work,
however, it is not known a priori which covariates should be included
in a model, and often we are confronted with the number of candidate
variables in the range 10-30. This number is often too large to be
considered in a statistical model.'' I strongly disagree with this as
there are many statistical models working with more than million
candidate variables (see, e.g., @Peltola+etal:finite:2012). As the bodyfat dataset proved to be
quite easy in that sense that maximum likelihood performed well
compared to Bayesian approach, let's make the dataset a bit more
challenging.
We add 87 variables which are random normal distributed noise and thus are
not related to bodyfat in any way. We have now total of 100 variables.
```{r}
set.seed(SEED)
noise <- array(rnorm(87*n), c(n,87))
dfr<-cbind(df,noise=noise)
formula2<-formula(paste(paste("siri~", paste(pred, collapse = "+")),
"+",paste(colnames(dfr[,20:106]), collapse = "+")))
```
Given this new dataset compute do variable selection with full data and
compute 20-fold-CV for the lm model and step selected model.
```{r}
sel2 <- step(lm(formula2, data = dfr, x=T,y=T),
direction = "backward",
scope = list(upper = formula2,
lower = formula(siri~abdomen+height)),
trace = 0)
lmmuss <- list()
lmvsmuss <- list()
lmvsnvss <- list()
lmvsnlmuss <- list()
lmvsnlnvss <- list()
for (k in 1:K) {
message("Fitting model ", k, " out of ", K)
omitted <- which(bin == k)
lmfit_k <- lm(formula2, data = dfr[-omitted,, drop=FALSE], x=T,y=T)
lmmuss[[k]] <- predict(lmfit_k, newdata = dfr[omitted, , drop = FALSE])
sel_k <- step(lm(formula2, data = dfr[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
scope = list(upper = formula2,
lower = formula(siri~abdomen+height)),
trace = 0)
lmvsmuss[[k]] <- predict(sel_k, newdata = dfr[omitted, , drop = FALSE])
lmvsnvss[[k]] <- length(coef(sel_k))-1
selnl_k <- step(lm(formula2, data = dfr[-omitted,, drop=FALSE], x=T,y=T),
direction = "backward",
trace = 0)
lmvsnlmuss[[k]] <- predict(selnl_k, newdata = dfr[omitted, , drop = FALSE])
lmvsnlnvss[[k]] <- length(coef(selnl_k))-1
}
lmmus<-unlist(lmmuss)[order(as.integer(names(unlist(lmmuss))))]
lmvsmus<-unlist(lmvsmuss)[order(as.integer(names(unlist(lmvsmuss))))]
lmvsnvs <- unlist(lmvsnvss)
lmvsnlmus<-unlist(lmvsnlmuss)[order(as.integer(names(unlist(lmvsnlmuss))))]
lmvsnlnvs <- unlist(lmvsnlnvss)
rmse_lmfull <- sqrt(mean((df$siri-lmmus)^2))
rmse_lmsel <- sqrt(mean((df$siri-lmvsmus)^2))
rmse_lmselnl <- sqrt(mean((df$siri-lmvsnlmus)^2))
```
```{r}
print(names(coef(sel2)))
print(paste('20-fold-CV RMSE lm full model: ', round(rmse_lmfull,1)))
print(paste('20-fold-CV RMSE lm step selected model: ', round(rmse_lmsel,1)))
```
Variable selection with `step` has selected 38 variables from which 32
are random noise. 20-fold-CV RMSE for full lm model and selected model
are higher than with the original data.
How about then Bayesian model with regularized horseshoe and projection
predictive variable selection?
```{r}
p <- 100
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(global_scale=tau0)
fitrhs2 <- stan_glm(formula2, data = dfr, prior = hs_prior, QR = TRUE,
seed=SEED, refresh=0)
fitrhs2_cvvs <- cv_varsel(fitrhs2, method = 'forward', cv_method = 'loo',
nloo=n, verbose = FALSE)
nsel2 <- suggest_size(fitrhs2_cvvs, alpha=0.1)
vsel2 <- solution_terms(fitrhs2_cvvs)[1:nsel2]
loormse_full2 <- sqrt(mean((df$siri-fitrhs2_cvvs$summaries$ref$mu)^2))
loormse_proj2 <- sqrt(mean((df$siri-fitrhs2_cvvs$summaries$sub[[nsel2]]$mu)^2))
print(paste('PSIS-LOO RMSE Bayesian full model: ', round(loormse_full2,1)))
print(paste('PSIS-LOO RMSE selected projpred model: ', round(loormse_proj2,1)))
```
Variable selection with `projpred` has selected 2 variables from which 0
are random noise. PSIS-LOO RMSE for the full Bayesian model and the
selected `projpred` model are similar as with the original data.
If you don't trust PSIS-LOO you can run the following K-fold-CV to
get almost the same result.
```{r}
writeLines(readLines("bodyfat_kfoldcv2.R"))
```
For this notebook this was run separately and the saved results are shown below
```{r}
load(file="bodyfat_kfoldcv2.RData")
print(paste('10-fold-CV RMSE Bayesian full model: ', round(rmse_full,1)))
print(paste('10-fold-CV RMSE Bayesian projpred model: ', round(rmse_proj,1)))
```
<br />
# References {.unnumbered}
<div id="refs"></div>
# Licenses {.unnumbered}
* Code © 2018, Aki Vehtari, licensed under BSD-3.
* Text © 2018-2021, Aki Vehtari, licensed under CC-BY-NC 4.0.
# Original Computing Environment {.unnumbered}
```{r}
sessionInfo()
```
<br />