This repository has been archived by the owner on Sep 30, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 41
/
2018-03-21 Overfitting and Model Comparison.Rpres
374 lines (260 loc) · 10.4 KB
/
2018-03-21 Overfitting and Model Comparison.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
Overfitting, Regularization and Information Criteria
========================================================
author: Raphael Peer
date: 21-3-2018
autosize: true
Schedule for today's talk
========================================================
- The problem of overfitting and model selection
- Information theory and Information Criterions
- Regularization
- Exercises
Part I: Overfitting and model selection
========================================================
Example: body mass vs. brain volume in hominin species
========================================================
![alt text](figures/overfitting_images/overfitting.png)
In sample vs. out of sample error
========================================================
![alt text](figures/overfitting_images/in_out_sample.png)
(image from https://thebayesianobserver.wordpress.com/tag/sample-complexity/)
Effect of dropping one data point
========================================================
![alt text](figures/overfitting_images/dropping_one_datapoint.png)
What is a good model?
========================================================
- Every model contains uncertainty
- A good model has less uncertainty than a bad model
- But how can we measure uncertainty...?
Part II: Information theory and Information Criterions
========================================================
Information Entropy: A measure for uncertainty
========================================================
![alt text](figures/overfitting_images/H.png)
Entropy H increases with the number of possible outcomes
========================================================
```{r}
p1 <- c( 0.3 , 0.7 )
cat('H1 = ', -sum( p1*log(p1) ))
p2 <- c( 0.2 , 0.2, 0.6 )
cat('H2 = ', -sum( p2*log(p2) ))
```
Entropy H is larger the more uniform the probabilities are
========================================================
```{r}
p1 <- c( 0.1 , 0.9 )
cat('H1 = ', -sum( p1*log(p1) ))
p2 <- c( 0.5 , 0.5 )
cat('H2 = ', -sum( p2*log(p2) ))
```
How can Entropy help us with model selection?
========================================================
- Assume, we knew the real probabilities of events (which of course we don't).
- How much does the uncertainty increase, if we use estimated probabilities (model) instead?
- For a good model, this divergence should be as small as possible.
Kullback-Leibler Divergence
========================================================
![alt text](figures/overfitting_images/DKL.png)
Kullback-Leibler Divergence: Example
========================================================
real probabilities for rain / no rain in Salzburg: p = (0.5 , 0.5)
Estimated probabilites for rain / no rain in Salzburg: q = (0.3 , 0.7)
```{r}
p <- c(0.5,0.5) # real probabilites of rain / no rain in Salzburg
q <- c(0.3,0.7) # estimated probabilites
cat('DKL = ', (-sum( p*log(p)) + sum( q*log(q))))
```
Kullback-Leibler Divergence
========================================================
Because the real probabilites p are unknown, KL-Divergence from them cannot be computed.
However, to select te best model, we only need to know which one is closest (not how close).
Deviance
========================================================
![alt text](figures/overfitting_images/deviance.png)
Still just a measure for goodness of fit: will always decrease with additional parameters
Akaike Information Criterion
========================================================
![alt text](figures/overfitting_images/AIK.png)
Uses deviance (goodness of fit) but introduces as penalty for the the number of parameters.
Note: no priors used for the AIC
Deviance Information Criterion (DIC):
========================================================
Posterior distribution of deviance.
Widely Acceptable Information Criterion (WAIC):
========================================================
For comparison: AIC
![alt text](figures/overfitting_images/aik2.png)
In short, use AIC and:
- Replace the log-likelyhood with a the average log-likelihood of each training data point
- replace the number of parameters k with a measure of the " effective number of parameters"
Part III: Regularization
========================================================
Concept: Certain parameters are deemed more likely than others.
Example: Gaussian Priors
========================================================
![alt text](figures/overfitting_images/priors.png)
Experiment with simulated data
========================================================
Fit models with different number of parameters
![alt text](figures/overfitting_images/prior_line.png)
Observations
========================================================
- Priors have a higher effect for small sample size
- Priors reduce the goodness of fit on the training data
- Priors reduce overfitting but may lead to unterfitting
- Weak priors: not effective in preventing overfitting
- Strong priors: may lead to underfitting
Part IV: Exercises
========================================================
6E1
========================================================
State the three motivating criteria that define information entropy. Try to express each in your
own words.
- Should be continuous: no jumps in Entropy due to slight changes in probabilites
- Should increase with the number of possible outcomes
- Should be larger for more uniform probabilites and smaller if the probabilities are wildly different
6E2
========================================================
Suppose a coin is weighted such that, when it is tossed and lands on a table, it comes up heads
70% of the time. What is the entropy of this coin?
```{r}
p <- c( 0.3 , 0.7 )
cat('H = ', -sum( p*log(p )))
```
6E3
========================================================
Suppose a four-sided dice is loaded such that, when tossed onto a table, it shows 1 20%, 2
25%, 3 25%, and 4 30% of the time. What is the entropy of this die?
```{r}
p <- c( 0.2, 0.25, 0.25, 0.3 )
cat('H = ', -sum( p*log(p )))
```
6E4
========================================================
Suppose another four-sided die is loaded such that it never shows “4”. The other three sides
show equally often. What is the entropy of this dice?
```{r}
p <- c( 1/3, 1/3, 1/3 )
cat('H = ', -sum( p*log(p )))
```
6M2
========================================================
Explain the difference between model selection and model averaging. What information is lost
under model selection? What information is lost under model averaging?
- Model selection:
The best model is chosen (according to some criterion). We loose the information how much better than the other models it was.
- Model averaging:
Average predicions over multiple - presumably good - models. If the errors of these models are not completely correlated, the predicion accuracy increases.
6M3
========================================================
When comparing models with an information criterion, why must all models be fit to exactly
the same observations? What would happen to the information criterion values, if the models were
fit to different numbers of observations?
Information Criterions use the deviance D to compare models.
D is a sum over all data points.
Hence, for a fair comparison, the number of data points has to be equal.
Exercises on the Howell !Kung demography data
========================================================
```{r}
library(rethinking)
data(Howell1)
k <- Howell1
k$age <- (k$age - mean(k$age))/sd(k$age)
set.seed( 1000 )
i <- sample(1:nrow(k),size=nrow(k)/2)
k1 <- k[ i , ]
k2 <- k[ -i , ]
```
Fit models
========================================================
Use the same prior for all parameters
```{r}
prior_b <- 0 # if the standard deviation is reasonably large, it does not really matter if the
# mean is set to 0 or 1 or any other small number
prior_sd <- 10 # 10 is a strong prior / 1000 would be a very weak prior for standard deviation
```
First order
========================================================
```{r}
m1 <- map(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + b1*weight ,
a ~ dnorm( 156 , prior_sd ) ,
b1 <- dnorm( prior_b , prior_sd ) ,
sigma ~ dunif( 0 , prior_sd )
),
data=k1
)
m1
```
Second order
========================================================
```{r}
m2 <- map(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + b1*weight + b2*I(weight^2),
a ~ dnorm( 156 , prior_sd ) ,
b1 <- dnorm( prior_b , prior_sd ) ,
b2 <- dnorm( prior_b, prior_sd ) ,
sigma ~ dunif( 0 , prior_sd )
),
data=k1
)
m2
```
Third order
========================================================
```{r, echo=F}
m3 <- map(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + b1*weight + b2*I(weight^2) + b3*I(weight^3),
a ~ dnorm( 156 , prior_sd ) ,
b1 <- dnorm( prior_b , prior_sd ) ,
b2 <- dnorm( prior_b , prior_sd ) ,
b3 <- dnorm( prior_b , prior_sd ) ,
sigma ~ dunif( 0 , prior_sd )
),
data=k1
)
m3
```
Fourth order
========================================================
```{r, echo=F}
m4 <- map(
alist(
height ~ dnorm( mu , sigma ) ,
mu <- a + b1*weight + b2*I(weight^2) + b3*I(weight^3) + b4*I(weight^4),# + b5*I(weight^5), #+ b6*I(weight^6),
a ~ dnorm( 156 , prior_sd ) ,
b1 <- dnorm( prior_b , prior_sd ) ,
b2 <- dnorm( prior_b , prior_sd ) ,
b3 <- dnorm( prior_b , prior_sd ) ,
b4 <- dnorm( prior_b , prior_sd ) ,
sigma ~ dunif( 0 , prior_sd )
),
data=k1
)
m4
```
Fitted models
========================================================
```{r, echo=F}
require(polynom)
plot(k$weight, k$height, xlab='weight', ylab='height')
abline( a=coef(m1)["a"] , b=coef(m1)["b1"], lwd=2)
#plot( curve (coef(m1)["a"] + weight* coef(m1)["b1"] + (weight^2)*weight*coef(m1)["b2"] ) )
p2 <- polynomial(c(coef(m2)["a"], coef(m2)["b1"], coef(m2)["b2"] ))
lines(p2, col='blue', lwd=2)
p3 <- polynomial(c(coef(m3)["a"], coef(m3)["b1"], coef(m3)["b2"], coef(m3)["b3"] ))
lines(p3, col='green', lwd=2)
p4 <- polynomial(c(coef(m4)["a"], coef(m4)["b1"], coef(m4)["b2"], coef(m4)["b3"], coef(m4)["b4"] ))
lines(p4, col='red', lwd=2)
```
compare models with WAIC
========================================================
```{r}
compare(m1, m2, m3, WAIC=T)
```