generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 1
/
22.Rmd
149 lines (89 loc) · 6.81 KB
/
22.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# Advanced regression and multilevel models
**Learning objective:**
- Introduction to other methods not covered in detail in this book
## Expressing the models so far in a common framework
Much can be done with the basic model:
$$
y = X\beta + \epsilon
$$
* Maximum likelihood for point estimation
* Including priors (regularization)
* Sampling the posterior to include uncertainty
* Forecasting including predictive uncertainty
* Additive nonlinear models (e.g. polynomials)
## Incomplete data
Imputing missing data was covered in Chapter 17, but there are other types of incomplete data, for example:
* Survival analysis / Censoring - See for example [Introduction to Statistical Learning](https://www.statlearning.com/) for more on this.
* Measurement error:
* Errors in $y$ can be folded into the error term
* Errors in $x$ are more challenging, mathematics similar to *instrumental* variables of chapter 21. Book has references for more.
## Correlated errors
This book assumes uncorrelated errors, but this can be generalized to considered correlated /structured errors. Examples include:
* Time series
* Spatial correlation
* Networks / Graphs
* Factor analysis
Each of these examples has inspired text book length treatments!
## Many Predictors
* Often we want to include a large number of predictors, e.g. to make ignorability more plausible.
* Challenge is to avoid overfitting leading to increased variance and uncertainty in the estimations. If you cannot reduce the parameters some options include:
* Combine (related) predictors in a structured way
* Regularization, e.g. informative priors, horseshoe (breifly mentioned in Chapter 11) and lasso.
## Multilevel or hierachical models
* Can often make sense to allow regression coefficients to vary by group, which can be included by simply incuding the group as a factor in the model:
```
stan_glm(y ~ x + factor(state) + x:factor(state),... )
```
* When number of observations per group is small, consider instead multlevel regression which is a method of partially pooling the varying coefficients. See for example [Bayes Rules](https://www.bayesrulesbook.com/) or [Baysian Data Analysis](http://www.stat.columbia.edu/~gelman/book/). There is also supposed to be second volume to follow this one, called [Applied Regression and Multilevel Models](http://www.stat.columbia.edu/~gelman/armm/), but seems to be stalled.
## Nonlinear models
* Often nonlinear models cannot be expressed in terms of linear predictors.
* Stan can fit these models!
* See demo 'Golf' [tidy-ros](https://github.com/behrman/ros)
## Nonparametric regression and machine learning
![](images/machine_learn.jpg)
* *Nonparamentric regression* - the regression curve is not constrained to follow any particular parametric form.
* *Machine Learning* describes nonparametric regression where the focus is more on prediction rather then parameter estimation. Performance is often assessed on held out 'test' data.
* To avoid overfitting, nonparametric models use a variety of techniques to constrain the model and Tuning parameters (hyperparameters) govern the amount of constraint, typically optimized using cross-validation.
* Some examples of nonparametric models:
* *Loess* - locally weighted regression, tuned by the stength of the weight function.
* *Splines* - nonlinear basis functions , tuning controls the 'local smoothness'
* *Gaussian processes* - multivariate Gaussian model, tuning controls the correlation distance
* *Tree models* - Decision trees are very powerful nonparametric models, especially gradient boosted trees (e.g. [XGBoost](https://xgboost.readthedocs.io/en/stable/)).
* *BART* - Bayesian additive regression trees. Somehow includes priors over trees and leaf values. For more, ROS recommends: [Bayesian Additive Regression Trees: A Review and Look Forward](https://www.annualreviews.org/doi/abs/10.1146/annurev-statistics-031219-041110)
* Many of these methods are covered or at least introduced in [Introduction to Statistical Learning](https://www.statlearning.com/)
## Machine learning meta-algorithms {-}
* *Meta-algorithms* - build flexible models small interchangeable components
* Ensemble learning - average over multiple models
* Deep learning - combine simple (differentiable) models into larger more flexible models.
* Genetic algorithms - "evolve" models
* Linear and logistic regression make *strong* assumptions that allow us to *summarize* small data sets. Bayeisan inference an allow even strong statements on parameter uncertainty.
* Machine learning meta-algorithms supply very little structure ('inductive bias') and provide maximum flexibility for problems where *enough data* is available to support it.
## Computational efficency in Stan
* Stan uses "Hamiltonian Monte Carlo" by default, which produces a random walk in parameter space.
* This is an *iterative* and *stochastic* process.
* By default produces 4 parallel chains of 1000 draws each = 4000 total draws
* Diagnostics can help evaluate the simulation:
* R-hat - compares different chains. If not near 1 then the chains have not fully mixed.
* N_eff - the effective number of samples (samples are correlated due to iterative nature of the simulations.). 'Usually' n_eff > 400 is sufficient.
* mcse - "Monte carlo standard error" - additional uncertainty due to the stochastic algorithm, negligable in all examples in this book.
* With larger and more complex data sets and / or predictors, computation speed can be a limiting factor. Some options:
* Parallel processing - rstan can take advantage of multiple processors if they are available. `options(mc.cores = parallel::detectCores())`
* Mode-based approximations - `stan_glm` can be made as fast as `glm` while retaining the advantages of Bayesian inference by approximating the full Bayesian calculation. One method ("optimizing") uses a normal approximation centered at the posterior mode.
* See demo 'Scalability' [tidy-ros](https://github.com/behrman/ros)
* Other ('Variational inference') algorithms are available but beyond the scope of this book.
## The End
![The End](images/you-did-it.jpg)
## Meeting Videos
### Cohort 2
`r knitr::include_url("https://www.youtube.com/embed/xQeCaHIdzPo")`
<details>
<summary> Meeting chat log </summary>
```
00:06:26 Ron Legere: start
00:18:40 Ron Legere: https://www.amazon.com/Doing-Bayesian-Data-Analysis-Tutorial/dp/0124058884
00:37:09 Ron Legere: https://www.nature.com/articles/nmeth.4642
00:58:58 Ron Legere: end
00:59:38 Ron Legere: https://billpetti.github.io/baseballr/
01:00:12 Ron Legere: https://www.routledge.com/Analyzing-Baseball-Data-with-R-Second-Edition/Marchi-Albert-Marchi-Albert-Baumer/p/book/9780815353515?utm_source=cjaffiliates&utm_medium=affiliates&cjevent=9ae3373e1c3711ee812100090a1eba23
```
</details>