-
Notifications
You must be signed in to change notification settings - Fork 8
/
notes-06_hmc-nuts-stan_bda3-12.Rmd
247 lines (194 loc) · 10.1 KB
/
notes-06_hmc-nuts-stan_bda3-12.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Section 6. HMC, NUTS, and Stan
2021-10-04
```{r setup, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
```
## Resources
- BDA3 chapter 12 and [reading instructions](`r paste0(CM_URL, "BDA3_ch12_reading-instructions.pdf")`)
- lectures:
- ['6.1 HMC, NUTS, dynamic HMC and HMC specific convergence diagnostics'](https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=1744f6a0-84d3-4218-8a86-aae600ba7e84)
- ['6.2 probabilistic programming and Stan'](https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=e60ba1a9-f752-4b0a-88c6-aae600caa61a)
- [slides](`r paste0(CM_URL, "slides_ch12.pdf")`)
- [Assignment 6](`r paste0(CM_URL, "assignment-06.pdf")`)
## Notes
### Reading instructions
#### Outline of the chapter 12 {-}
- 12.1 Efficient Gibbs samplers (not part of the course)
- 12.2 Efficient Metropolis jump rules (not part of the course)
- 12.3 Further extensions to Gibbs and Metropolis (not part of the course)
- 12.4 Hamiltonian Monte Carlo (used in Stan)
- 12.5 Hamiltonian dynamics for a simple hierarchical model (read through)
- 12.6 Stan: developing a computing environment (read through)
#### Hamiltonian Monte Carlo {-}
- review of static HMC (the number of steps in dynamic simulation are not adaptively selected) is @Neal2012-mu
- Stan uses a variant of dynamic Hamiltonian Monte Carlo (using adaptive number of steps in the dynamic simulation), which has been further developed since BDA3 was published
- The first dynamic HMC variant was by @Hoffman2011-nx
- The No-U-Turn Sampler (NUTS) is often associated with Stan, but the current dynamic HMC variant implemented in Stan has some further developments described (mostly) by @Betancourt2017-bp
- Instead of reading all above, you can also watch a video: [Scalable Bayesian Inference with Hamiltonian Monte Carlo](https://wwwyoutube.com/watch?v=jUSZboSq1zg) by Betancourt
#### Divergences and BFMI {-}
- divergences and Bayesian Fraction of Missing Information (BFMI) are HMC specific convergence diagnostics developed by Betancourt after BDA3 was published
- Divergence diagnostic checks whether the discretized dynamic simulation has problems due to fast varying density.
- See more in a [case study](http://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html)
- BFMI checks whether momentum resampling in HMC is sufficiently efficient [@Betancourt2016-je]
- [Brief Guide to Stan’s Warnings](https://mc-stan.org/misc/warnings.html) provides summary of available convergence diagnostics in Stan and how to interpret them.
### Chapter 12. Computationally efficient Markov chain simulation
(skipping sections 12.1-12.3)
#### 12.4 Hamiltonian Monte Carlo {-}
- random walk of Gibbs sampler and Metropolis algorithm is inherently inefficient
- Hamiltonian Monte Carlo (HMC) uses "momentum" to suppress the local random walk behavior of the Metropolis algorithm
- such that is moves more rapidly through the target distribution
- for each component $\theta_j$ in the target space, there is a corresponding *momentum* variable $\phi_j$
- the posterior density $p(\theta|y)$ is augmented by an *independent* distribution $p(\phi|y)$ on the momentum: $p(\theta, \phi | y) = p(\phi) p(\theta | y)$
- to compute the momentum, HMC requires the gradient of the log-posterior density
- in practice, this is computed analytically
##### The momentum distribution $p(\phi)$ {-}
- common to use a multivariate normal distribution with mean 0 and a diagonal *mass matrix* $M$
- $\phi_j \sim \text{N}(0, M_{jj})$ for each $j = 1, \dots, d$
- ideally, the mass matrix should scale with the inverse covariance matrix of the posterior distribution $(\text{var}(\theta|y))^{-1}$
##### The three steps of an HMC iteration {-}
1. update $\phi$ by sampling from its posterior (same as its prior): $\phi \sim \text{N}(0, M)$
2. simultaneously update $\theta$ and $\phi$ using a leapfrog algorithm to simulate physical Hamiltonian dynamics where the position and momentum evolve continuously; for $L$ leapfrog steps with scaling factor $\epsilon$:
a. use the gradient of the log-posterior density of $\theta$ to make a half-step of $\phi$: $\phi \leftarrow \phi + \frac{1}{2} \epsilon \frac{d \log p(\theta|y)}{d \theta}$
b. use the momentum $\phi$ to update position $\theta$: $\theta \leftarrow \theta + \epsilon M^{-1} \phi$
c. use the gradient of $\theta$ for the second half-step of $\phi$: $\phi \leftarrow \phi + \frac{1}{2} \epsilon \frac{d \log p(\theta|y)}{d \theta}$
3. calculate the accept/reject ratio $r$ \@ref(eq:accept-reject-r)
4. set $\theta^t = \theta^*$ with probability $\min(r, 1)$, else reject the proposed $\theta$ and set $\theta^t = \theta^{t-1}$
\begin{equation}
r
= \frac{p(\theta^*, \phi^* | y)}{p(\theta^{t-1}, \phi^{t-1} | y)}
= \frac{p(\theta^*|y) p(\phi^*)}{p(\theta^{t-1}|y) p(\phi^{t-1})}
(\#eq:accept-reject-r)
\end{equation}
> Pause here and watch video on HMC by Betancourt: [Scalable Bayesian Inference with Hamiltonian Monte Carlo](https://youtu.be/jUSZboSq1zg).
#### 12.5 Hamiltonian Monte Carlo for a hierarchical model {-}
(Walks through the process of deciding on model and HMC parameters and tuning $\epsilon$ and $L$ for HMC.)
#### 12.6 Stan: developing a computing environment {-}
(*Very* briefly describes Stan.)
### Lecture notes
#### 6.1 HMC, NUTS, dynamic HMC and HMC specific convergence diagnostics {-}
*Definitely worth looking at the visualizations in this blog post: [Markov Chains: Why Walk When You Can Flow?](http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/)*
*Interactively play with HMC and NUTS: [MCMC demo](https://chi-feng.github.io/mcmc-demo/)*
- Hamiltonian Monte Carlo
- uses gradient of log density for more efficient sampling of the posterior
- parameters: step size $\epsilon$, number of steps in each chain $L$
- if step size is too large, then the chain can get further and further away from the high density regions leading to "exploding error"
- can experiment with this in the simulation linked above
- No U-Turn Sampling (NUTS) and dynamic HMC
- adaptively selects the number of steps to improve robustness and sampling efficiency
- "dynamic" HMC refers to dynamic trajectory length
- to maintain "reversibility" condition for the Markov chain, must simulate in two directions
- dynamic HMC in Stan
- use a growing tree to increase simulation trajectory until no-U-turn stopping criterion
- there is a *max tree depth* parameter to control the size of this
- pick a draw along the trajectory with probability adjusted to account for the error in discretized dynamic simulation and higher probability for parts further away from the start
- therefore, don't always end up at the end of the trajectory, but usually somewhere near the end
- Stan can also adjust the mass matrix to "reshape" the posterior to make more circular and reduce correlations between parameters to make sampling faster and more efficient
- occurs during the initial adaptation in the warm-up
- max tree depth diagnostic
- indicates inefficiency in sampling leading to higher autocorrelation and lower ESS
- possibly step size is too small
- different parameterizations matter
- divergences
- HMC specific
- indicates that there are unexpectedly fast changes in log-density
- comes from exploding error where the chain leaves high-density regions of the posterior and gets lost in other directions
- occurs in funnel geometries common in hierarchical models because in the neck of the funnel, the high-density region is so thin, the trajectory can easily leave and enter a very low-density region
- problematic distributions
- *Nonlinear dependencies*
- simple mass matrix scaling doesn’t help
- *Funnels*
- optimal step size depends on location
- can get stuck in the neck of the funnel causing divergences
- *Multimodal*
- difficult to move from one mode to another
- can be seen if multiple chains end up in different locations
- *Long-tailed with non-finite variance and mean*
- efficiency of exploration is reduced
- central limit theorem doesn’t hold for mean and variance
#### 6.2 probabilistic programming and Stan {-}
example: Binomial model
```stan
data {
int <lower=0> N; // number of experiments
int<lower=0,upper=N> y; // number of successes
}
parameters {
real <lower=0,upper=1> theta ; // parameter of the binomial
}
model {
theta ~ beta(1,1); //prior
y ~ binomial (N, theta ); // observation model
}
```
example: running Stan from R
```r
library (rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel ::detectCores())
d_bin <- list(N = 10, y = 7)
fit_bin <- stan(file = "binom.stan", data = d_bin)
```
example: running Stan from Python
```python
import pystan
import stan_utility
data = {"N": 10, "y": 8}
model = stan_utility.compile_model('binom.stan')
fit = model.sampling(data=data)
```
example: Difference between proportions
```stan
data {
int<lower=0> N1;
int<lower=0> y1;
int<lower=0> N2;
int<lower=0> y2;
}
parameters {
real <lower=0,upper=1> theta1;
real <lower=0,upper=1> theta2;
}
model {
theta1 ~ beta(1,1);
theta2 ~ beta(1,1);
y1 ~ binomial(N1,theta1);
y2 ~ binomial(N2,theta2);
}
generated quantities {
real oddsratio;
oddsratio = (theta2 / (1 - theta2)) / (theta1 / (1 - theta1))
}
```
some HMC-specific diagnostics with 'rstan'
```r
check_treedepth(fit_bin2)
check_energy(fit_bin2)
check_div(fit_bin2)
```
example of scaling data in the Stan language
```stan
data {
int<lower=0> N; // number of data points
vector[N] x;
vector[N] y;
real xpred; // input location for prediction
}
transformed_data {
vector[N] x_std;
vector[N] y_std;
real xpred_std;
x_std = (x - mean(x)) / sd(x);
y_std = (y - mean(y)) / sd(y);
xpred_std = (xpred - mean(x)) / sd(x);
}
```
- other useful R packages worth looking into:
- 'rstanarm' and 'brms': for quickly building models with the R formula language
- 'shinystan': interactive diagnostics
- 'bayesplot': visualization and model checking (see model checking in Ch 6)
- 'loo': cross-validation model assessment, comparison and averaging (see Ch 7)
- 'projpred': projection predictive variable selection
---
```{r}
sessionInfo()
```