-
Notifications
You must be signed in to change notification settings - Fork 36
/
candy.Rmd
228 lines (181 loc) · 8.31 KB
/
candy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: "Bayesian variable selection for candy ranking data"
author: "[Aki Vehtari](https://users.aalto.fi/~ave/)"
date: "First version 2018-02-27. Last modified `r format(Sys.Date())`."
output:
html_document:
self_contained: true
fig_caption: yes
toc: TRUE
toc_depth: 2
number_sections: TRUE
toc_float:
smooth_scroll: FALSE
code_download: true
bibliography: modelsel.bib
csl: harvard-cite-them-right.csl
link-citations: yes
---
# Setup {.unnumbered}
```{r setup, include=FALSE}
knitr::opts_chunk$set(cache=FALSE, message=FALSE, error=FALSE, warning=FALSE, comment=NA, out.width='95%')
```
**Load packages**
```{r}
library(rstanarm)
options(mc.cores = parallel::detectCores())
library(loo)
library(tidyverse)
library(bayesplot)
library(projpred)
library(fivethirtyeight)
SEED=150702646
```
# Introduction
This notebook was inspired by Joshua Loftus' two blog posts [Model selection bias invalidates significance tests](http://joshualoftus.com/post/model-selection-bias-invalidates-significance-tests/) and [A conditional approach to inference after model selection](http://joshualoftus.com/post/conditional-approach-to-inference-after-model-selection/).
In this notebook we illustrate Bayesian inference for model selection, including PSIS-LOO [@Vehtari+etal:PSIS-LOO:2017] and projection predictive approach [@Piironen+etal:projpred:2020; @Piironen+Vehtari:2017a] which makes decision theoretically justified inference after model selection..
# Data
We use candy rankings data from fivethirtyeight package. Dataset was originally used in [a fivethirtyeight story](http://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/).
```{r}
df <- candy_rankings %>%
select(-competitorname) %>%
mutate_if(is.logical, as.numeric)
head(df)
```
# Null data
We start first analysing a "null" data set, where winpercent has been replaced with random draws from a normal distribution so that covariates do not have any predictive information.
```{r}
dfr <- df %>% select(-winpercent)
n <- nrow(dfr)
p <- ncol(dfr)
prednames <- colnames(dfr)
set.seed(SEED)
ry = rnorm(n)
dfr$ry <- ry
(reg_formula <- formula(paste("ry ~", paste(prednames, collapse = " + "))))
```
The `rstanarm` package provides `stan_glm` which accepts same arguments as `glm`, but makes full Bayesian inference using Stan ([mc-stan.org](https://mc-stan.org)). Doing variable selection we are anyway assuming that some of the variables are not relevant, and thus it is sensible to use priors which assume some of the covariate effects are close to zero. We use regularized horseshoe prior [@Piironen+Vehtari:RHS:2017] which has lot of prior mass near 0, but also thick tails allowing relevant effects to not shrunk.
```{r}
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(df=1, global_df=1, global_scale=tau0)
t_prior <- student_t(df = 7, location = 0, scale = 2.5)
fitrhs <- stan_glm(reg_formula, data = dfr,
prior = hs_prior, prior_intercept = t_prior,
seed=SEED, refresh=0)
```
Let's look at the summary:
```{r}
summary(fitrhs)
```
We didn't get divergences, Rhat's are less than 1.1 and n_eff's are useful (see, e.g., [RStan workflow](http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html)).
```{r}
mcmc_areas(as.matrix(fitrhs), prob_outer = .95)
```
All 95% posterior intervals are overlapping 0, regularized horseshoe prior makes the posteriors concentrate near 0, but there is some uncertainty.
We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,
```{r}
fit0 <- stan_glm(ry ~ 1, data = dfr, seed=SEED, refresh=0)
```
```{r}
(loorhs <- loo(fitrhs))
(loo0 <- loo(fit0))
loo_compare(loo0, loorhs)
```
Based on cross-validation covariates together do not contain any useful information, and there is no need to continue with variable selection. This step of checking whether full mode has any predictive power is often ignored especially when non-Bayesian methods are used. If loo (or AIC as Joshua Loftus demonstrated) would be used for stepwise variable selection it is possible that selection process over a large number of models overfits to the data.
To illustrate the robustness of projpred, we make the projective predictive variable selection using the previous model for "null" data. A fast leave-one-out cross-validation approach [@Vehtari+etal:PSIS-LOO:2017] is used to choose the model size.
```{r, results='hide'}
fitrhs_cv <- cv_varsel(fitrhs, method='forward', cv_method='loo', n_loo=n)
```
```{r}
fitrhs_cv$vind
```
We can now look at the estimated predictive performance of smaller models compared to the full model.
```{r}
plot(fitrhs_cv, stats = c('elpd', 'rmse'))
```
We can see that the differences to the full model are very small.
And we get a LOO based recommendation for the model size to choose
```{r}
(nv <- suggest_size(fitrhs_cv, alpha=0.1))
```
We see that projpred agrees that no variables have useful information.
Next we form the projected posterior for the chosen model.
```{r}
projrhs <- project(fitrhs_cv, nv = nv, ns = 4000)
round(colMeans(as.matrix(projrhs)),1)
round(posterior_interval(as.matrix(projrhs)),1)
```
This looks good as the true values for "null" data are intercept=0, sigma=1.
# Original data
Next we repeat the above analysis with original target variable winpercent.
```{r}
reg_formula <- formula(paste("winpercent ~", paste(prednames, collapse = " + ")))
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(p-p0) * 1/sqrt(n)
hs_prior <- hs(df=1, global_df=1, global_scale=tau0)
t_prior <- student_t(df = 7, location = 0, scale = 2.5)
fitrhs <- stan_glm(reg_formula, data = df,
prior = hs_prior, prior_intercept = t_prior,
seed=SEED, refresh=0)
```
Let's look at the summary:
```{r}
summary(fitrhs)
```
We didn't get divergences, Rhat's are less than 1.1 and n_eff's are useful.
```{r}
mcmc_areas(as.matrix(fitrhs), prob_outer = .95)
```
95% posterior interval for `chocolateTRUE` is not overlapping 0, so maybe there is something useful here.
In case of collinear variables it is possible that marginal posteriors overlap 0, but the covariates can still useful for prediction. With many variables it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,
```{r}
fit0 <- stan_glm(winpercent ~ 1, data = df, seed=SEED, refresh=0)
```
```{r}
(loorhs <- loo(fitrhs))
(loo0 <- loo(fit0))
loo_compare(loo0, loorhs)
```
Based on cross-validation covariates together do contain useful information. If we need just the predictions we can stop here, but if we want to learn more about the relevance of the covariates we can continue with variable selection.
We make the projective predictive variable selection using the previous model for "null" data. A fast leave-one-out cross-validation approach is used to choose the model size.
```{r, results='hide'}
fitrhs_cv <- cv_varsel(fitrhs, method='forward', cv_method='loo', n_loo=n)
```
```{r}
fitrhs_cv$vind
```
We can now look at the estimated predictive performance of smaller models compared to the full model.
```{r}
plot(fitrhs_cv, stats = c('elpd', 'rmse'))
```
Only one variable seems to be needed to get the same performance as the full model.
And we get a LOO based recommendation for the model size to choose
```{r}
(nsel <- suggest_size(fitrhs_cv, alpha=0.1))
(vsel <- solution_terms(fitrhs_cv)[1:nsel])
```
projpred recommends to use just one variable.
Next we form the projected posterior for the chosen model.
```{r}
projrhs <- project(fitrhs_cv, nv = nsel, ns = 4000)
projdraws <- as.matrix(projrhs)
colnames(projdraws) <- c("Intercept",vsel,"sigma")
round(colMeans(projdraws),1)
round(posterior_interval(projdraws),1)
```
```{r}
mcmc_areas(projdraws)
```
In our loo and projpred analysis, we find the `chocolateTRUE` to have predictive information. Other variables may have predictive power, too, but conditionally on `chocolateTRUE` other variables do not provide additional information.
<br />
# References {.unnumbered}
<div id="refs"></div>
# Licenses {.unnumbered}
* Code © 2017-2018, Aki Vehtari, licensed under BSD-3.
* Text © 2017-2018, Aki Vehtari, licensed under CC-BY-NC 4.0.
# Original Computing Environment {.unnumbered}
```{r}
sessionInfo()
```
<br />