-
Notifications
You must be signed in to change notification settings - Fork 23
/
4_1_ST-eval.tex
1814 lines (1087 loc) · 46.1 KB
/
4_1_ST-eval.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%\documentclass[mathserif]{beamer}
\documentclass[handout]{beamer}
%\usetheme{Goettingen}
\usetheme{Warsaw}
%\usetheme{Singapore}
%\usetheme{Frankfurt}
%\usetheme{Copenhagen}
%\usetheme{Szeged}
%\usetheme{Montpellier}
%\usetheme{CambridgeUS}
%\usecolortheme{}
%\setbeamercovered{transparent}
\usepackage[english, activeacute]{babel}
\usepackage[utf8]{inputenc}
\usepackage{amsmath, amssymb}
\usepackage{dsfont}
\usepackage{graphics}
\usepackage{cases}
\usepackage{graphicx}
\usepackage{pgf}
\usepackage{epsfig}
\usepackage{amssymb}
\usepackage{multirow}
\usepackage{amstext}
\usepackage[ruled,vlined,lined]{algorithm2e}
\usepackage{amsmath}
\usepackage{epic}
\usepackage{epsfig}
\usepackage{fontenc}
\usepackage{framed,color}
\usepackage{palatino, url, multicol}
\usepackage{listings}
%\algsetup{indent=2em}
\vspace{-0.5cm}
\title{Model Evaluation and Information Criteria}
\vspace{-0.5cm}
\author[Felipe Bravo Márquez]{\footnotesize
%\author{\footnotesize
\textcolor[rgb]{0.00,0.00,1.00}{Felipe José Bravo Márquez}}
\date{ \today }
\begin{document}
\begin{frame}
\titlepage
\end{frame}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{frame}{Model Evaluation and Information Criteria}
\scriptsize{
\begin{itemize}
\item In this class we will introduce various concepts for evaluating statistical models.
\item According to \cite{mcelreath2020statistical}, there are two fundamental kinds of statistical error:
\begin{itemize}\scriptsize{
\item \textbf{Overfitting}: models that learn too much from the data leading to poor prediction.
\item \textbf{Underfitting}: models that learn too little from the data, which also leads to poor prediction.}
\end{itemize}
\item We will study the following approaches to tackle these problems.
\begin{itemize}\scriptsize{
\item \textbf{Regularization}: a mechanism to tell our models not to get too excited by the data.
\item \textbf{Information criteria} and \textbf{Cross-validation}: scoring devices to estimate predictive accuracy of our models.}
\end{itemize}
\item In order to introduce information criteria, this class must also introduce some concepts of \textbf{information theory}.
\end{itemize}
}
\end{frame}
\begin{frame}{The problem with parameters}
\scriptsize{
\begin{itemize}
\item In the class of linear regression we learned that including more attributes can lead to a more accurate model.
\item However, we also learned that adding more variables almost always improves the fit of the model to the data, as measured by the coefficient of determination $R^2$.
\item This is true even when the variables we add to a model have no relation to the outcome.
\item So it's no good to choose among models using only fit to the data.
\end{itemize}
}
\end{frame}
\begin{frame}{The problem with parameters}
\scriptsize{
\begin{itemize}
\item While more complex models fit the data better, they often predict new data worse.
\item This means that a complex model will be very sensitive to the exact sample used to fit it.
\item This will lead to potentially large mistakes when future data is not exactly like the past data.
\item But simple models, with too few parameters, tend instead to underfit, systematically over-predicting or under-predicting the data.
\item Regardless of how well future data resemble past data.
\item So we can't always favor either simple models or complex models.
\item Let’s examine both of these issues in the context of a simple data example.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{The problem with parameters}
\scriptsize{
\begin{itemize}
\item We are going to create a data.frame containing average brain volumes and body masses for seven hominin species.
\begin{verbatim}
sppnames <- c( "afarensis","africanus","habilis",
"boisei", "rudolfensis","ergaster",
"sapiens")
brainvolcc <- c( 438 , 452 , 612, 521, 752, 871,
1350 )
masskg <- c( 37.0 , 35.5 , 34.5 , 41.5 , 55.5 ,
61.0 , 53.5 )
d <- data.frame( species=sppnames , brain=brainvolcc,
mass=masskg )
\end{verbatim}
\item It's not unusual for data like this to be highly correlated.
\item Brain size is correlated with body size, across
species.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{The problem with parameters}
\scriptsize{
\begin{figure}[h!]
\centering
\includegraphics[scale=0.41]{pics/hominin.png}
\end{figure}
}
\end{frame}
\begin{frame}[fragile]{The problem with parameters}
\scriptsize{
\begin{itemize}
\item We will model brain size as a function of body size.
\item We will fit a series of increasingly complex model families and see which function fits the data best.
\item Each of these models will just be a polynomial of higher degree.
\begin{verbatim}
reg.ev.1 <- lm( brain ~ mass , data=d )
reg.ev.2 <- lm( brain ~ mass + I(mass^2)
, data=d )
reg.ev.3 <- lm( brain ~ mass + I(mass^2)
+ I(mass^3),data=d )
reg.ev.4 <- lm( brain ~ mass + I(mass^2)
+ I(mass^3) + I(mass^4),data=d )
reg.ev.5 <- lm( brain ~ mass + I(mass^2)
+ I(mass^3) + I(mass^4)
+ I(mass^5),data=d )
reg.ev.6 <- lm( brain ~ mass + I(mass^2)
+ I(mass^3) + I(mass^4)+
I(mass^5)+ I(mass^6),data=d )
\end{verbatim}
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{The problem with parameters}
\scriptsize{
\begin{itemize}
\item Let's calculate $R^2$ for each of these models:
\begin{verbatim}
> summary(reg.ev.1)$r.squared
[1] 0.490158
> summary(reg.ev.2)$r.squared
[1] 0.5359967
> summary(reg.ev.3)$r.squared
[1] 0.6797736
> summary(reg.ev.4)$r.squared
[1] 0.8144339
> summary(reg.ev.5)$r.squared
[1] 0.988854
> summary(reg.ev.6)$r.squared
[1] 1
\end{verbatim}
\item As the degree of the polynomial defining the mean increases, the fit always improves.
\item The sixth-degree polynomial actually has a perfect fit, $R ^2 = 1$.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{The problem with parameters}
\scriptsize{
\begin{figure}[h!]
\centering
\includegraphics[scale=0.41]{pics/polyover.png}
\end{figure}
Polynomial linear models of increasing degree, fit to the hominin data. Each plot shows the predicted mean in black, with 89\% interval of the mean shaded. $R^2$, is displayed above each plot. (a) First-degree polynomial. (b) Second-degree. (c) Third-degree. (d) Fourth-degree. (e) Fifth-degree. (f) Sixth-degree. Source: \cite{mcelreath2020statistical}.
}
\end{frame}
\begin{frame}{The problem with parameters}
\scriptsize{
\begin{itemize}
\item We can see from looking at the paths of the predicted means that the higher-degree polynomials are increasingly absurd.
\item For example, \textbf{reg.ev.6} the most complex model makes a perfect fit, but the model is ridiculous.
\item Notice that there is a gap in the body mass data, because there are no fossil
hominins with body mass between 55 kg and about 60 kg.
\item In this region, the models has nothing to predict, so it pays no price for swinging around wildly in this interval.
\item The swing is so extreme that at around 58 kg, the model predicts a negative brain size!
\item The model pays no price (yet) for this absurdity, because there are no cases in the data with body mass near 58 kg.
\end{itemize}
}
\end{frame}
\begin{frame}{The problem with parameters}
\scriptsize{
\begin{itemize}
\item Why does the sixth-degree polynomial fit perfectly?
\item Because it has enough parameters to assign one to each point of data.
\item The model's equation for the mean has 7 parameters:
\begin{displaymath}
y_i=\beta_{0}+\beta_{1}x_i +\beta_{2}x_i^2 +\beta_{3}x_i^3 +\beta_{4}x_i^4 +\beta_{5}x_i^5+\beta_{6}x_i^6+\epsilon_i \quad \forall i
\end{displaymath}
and there are 7 species to predict brain sizes for.
\item So effectively, this model assigns a unique parameter to reiterate each observed brain size.
\item This is a general phenomenon: If you adopt a model family with enough parameters, you can fit the data exactly.
\item But such a model will make rather absurd predictions for yet-to-be-observed cases.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Too few parameters hurts, too}
\scriptsize{
\begin{itemize}
\item The overfit polynomial models manage to fit the data extremely well.
\item But they suffer for this within-sample accuracy by making nonsensical out-of-sample predictions.
\item In contrast, underfitting produces models that are inaccurate both
within and out of sample.
\item For example, consider this model of brain volume:
\begin{displaymath}
y_i=\beta_{0}+\epsilon_i \quad \forall i
\end{displaymath}
\item There are no predictor variables here, just the intercept $\beta_{0}$.
\item We can fit this model as follows:
\begin{verbatim}
> reg.ev.0 <- lm( brain ~ 1 , data=d )
> summary(reg.ev.0)$r.squared
[1] 0
\end{verbatim}
\item The value of $R^2$ is 0.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Too few parameters hurts, too}
\scriptsize{
\begin{itemize}
\item This model estimates the mean brain volume, ignoring body mass.
\begin{figure}[h!]
\centering
\includegraphics[scale=0.41]{pics/regconstant.png}
\end{figure}
\item As a result, the regression line is perfectly horizontal and poorly fits both smaller and larger brain volumes.
\item Such a model not only fails to describe the sample.
\item It would also do a poor job for new data.
\end{itemize}
}
\end{frame}
\begin{frame}{Roadmap}
\scriptsize{
\begin{itemize}
\item The first thing we need to navigate between overfitting and underfitting problems is a criterion of model performance.
\item We will see how \textbf{information theory} provides a useful criterion for model evaluation: the \textbf{out-of-sample deviance}.
\item Once we learn about this criterion we will see how \textbf{cross-validation} and \textbf{information criteria} can help to estimate the out-of-sample deviance of a model.
\item We will also learn how to use \textbf{regularization} to improve the out-of-sample deviance of a model.
\item As usual, we will start with an example.
\end{itemize}
}
\end{frame}
\begin{frame}{The Weatherperson}
\scriptsize{
\begin{itemize}
\item Suppose in a certain city, a certain weatherperson issues uncertain predictions for rain or shine on each day of the year.
\item The predictions are in the form of probabilities of rain.
\item The currently employed weatherperson predicted these chances of rain over a 10-day sequence:
\begin{figure}[h!]
\centering
\includegraphics[scale=0.41]{pics/weather1.png}
\end{figure}
\item A newcomer rolls into town, and this newcomer boasts that he can best the current weatherperson, by always predicting sunshine.
\item Over the same 10 day period, the newcomer's record would be:
\begin{figure}[h!]
\centering
\includegraphics[scale=0.41]{pics/weather2.png}
\end{figure}
\end{itemize}
}
\end{frame}
\begin{frame}{The Weatherperson}
\scriptsize{
\begin{itemize}
\item Define \textbf{hit rate} as the average chance of a correct prediction.
\item So for the current weatherperson, she gets $3 \times 1 + 7 \times 0.4 = 5.8$ hits in 10 days, for a rate of $5.8/10 = 0.58$ correct predictions per day.
\item In contrast, the newcomer gets $3 \times 0 + 7 \times 1 = 7$, for $7/10 = 0.7$ hits per day.
\item The newcomer wins.
\item Let's compare now the two predictions using another metric, the \textbf{joint likelihood}: $\prod f(y_i;\theta)$ for a frequentist model or $\prod f(y_i|\theta)$ for a Bayesian one.
\item The joint likelihood corresponds to the joint probability of correctly predicting the observed sequence.
\item To calculate it, we must first compute the probability of a correct prediction for each day.
\item Then multiply all of these probabilities together to get the joint probability of correctly predicting the observed sequence.
\end{itemize}
}
\end{frame}
\begin{frame}{The Weatherperson}
\scriptsize{
\begin{itemize}
\item The probability for the current weather person is $1^3 \times 0.4^7 \approx 0.005$.
\item For the newcomer, it's $0^3 \times 1^7 = 0$.
\item So the newcomer has zero
probability of getting the sequence correct.
\item This is because the newcomer's predictions never expect rain.
\item So even though the newcomer has a high average probability of being correct
(hit rate), he has a terrible joint probability (likelihood) of being correct.
\item And the joint likelihood is the measure we want.
\item Because it is the unique measure that correctly counts up the relative number of ways each event (sequence of rain and shine) could happen.
\item In the statistics literature, this measure is sometimes called the
\textbf{log scoring rule}, because typically we compute the logarithm of the joint probability and report that.
\end{itemize}
}
\end{frame}
\begin{frame}{Information and uncertainty}
\scriptsize{
\begin{itemize}
\item So we want to use the log probability of the data to
score the accuracy of competing models.
\item The next problem is how to measure distance from
perfect prediction.
\item A perfect prediction would just report the true probabilities of rain on each day.
\item So when either weatherperson provides a prediction that differs from the target, we can measure the distance of the prediction from the target.
\end{itemize}
}
\end{frame}
\begin{frame}{Information and uncertainty}
\scriptsize{
\begin{itemize}
\item What kind of distance should we adopt?
\item Getting to the answer depends upon appreciating what an accuracy metric needs to do.
\item It should appreciate that some targets are just easier to hit than other targets.
\item For example, suppose we extend the weather forecast into the winter. Now there are three types of days:
rain, sun, and snow.
\item Now there are three ways to be wrong, instead of just two.
\item This has to be reflected in any reasonable measure of distance from the target, because by adding another type of event, the target has gotten harder to hit.
\item Before presenting a distance metric that satisfies the properties described above, we must introduce some concepts from information theory.
\end{itemize}
}
\end{frame}
\begin{frame}{Information Theory}
\scriptsize{
\begin{itemize}
\item The field of information theory, with Claude Shannon as one of its pioneering figures, was originally applied to problems of message communication, such as the telegraph.
\begin{figure}[h!]
\centering
\includegraphics[scale=0.2]{pics/shannon.png}
\end{figure}
\vspace{3mm}
\item The basic insight is to ask: How can we quantify the uncertainty inherent in a probability distribution?
\end{itemize}
}
\end{frame}
\begin{frame}{Entropy}
\scriptsize{
\begin{itemize}
\item Information entropy is a function that measures the uncertainty on probability functions.
\item Let $X$ be a discrete random variable of $m$ different possible events, with a probaility mass function $f$, the entropy of $X$ is defnined as follows:
\begin{equation}
H(X) = - \mathbb{E}(\log f(x)) = -\sum_{i=1}^{m}f(x_i)\log f(x_i)
\end{equation}
\item Usually we use log base 2, in which case the entropy units are called \textbf{bits} \cite{pml1Book}.
\item If $X$ is continuous with density function $f$, the entropy $H(X)$ takes the following form:
\begin{equation}
H(X) = - \mathbb{E}(\log f(x)) = -\int f(x)\log f(x)dx
\end{equation}
\item In plainer words: ``The uncertainty contained in a probability distribution is the negative average log-probability of an event''.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Entropy}
\scriptsize{
\begin{itemize}
\item An example will help to demystify the function $H(X)$.
\item To compute the information entropy for the weather, suppose the true probabilities of rain ($X=1$) and shine ($X=2$) are $f(1) = \mathbb{P}(X=1)= 0.3$ and $f(2) = \mathbb{P}(X=2) = 0.7$, respectively.
\item Then:
\begin{displaymath}
H(X) = - (f(1)\log f(1)+\left(f(2)\log f(2)\right) \approx 0.88
\end{displaymath}
\item As an R calculation:
\begin{verbatim}
> f <- c( 0.3 , 0.7 )
> -sum( f*log2(f) )
[1] 0.8812909
\end{verbatim}
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Entropy}
\scriptsize{
\begin{itemize}
\item Suppose instead we live in Abu Dhabi.
\item Then the probabilities of rain and shine might be more like $f(1) = 0.01$ and $f(2) = 0.99$.
\begin{verbatim}
> f <- c( 0.01 , 0.99 )
> -sum( f*log2(f) )
[1] 0.08079314
\end{verbatim}
\item Now the entropy is about 0.08.
\item Why has the uncertainty decreased?
\item Because in Abu Dhabi it hardly ever rains.
\item Therefore there's much less uncertainty about any given day, compared to a place in which it rains 30\% of the time.
\item These entropy values by themselves don't mean much to us, though.
\item Instead we can use them to build a measure of accuracy. That comes next.
\end{itemize}
}
\end{frame}
\begin{frame}{Divergence}
\scriptsize{
\begin{itemize}
\item How can we use information entropy to say how far a model is from the target?
\item The key lies in divergence.
\item Divergence: The additional uncertainty induced by using probabilities from
one distribution to describe another distribution.
\item This is often known as \textbf{Kullback-Leibler divergence} or simply K-L divergence, named after the people who introduced it for this purpose.
\item Suppose for example that the true distribution of events is encoded by function $f$: $f(1) = 0.3$, $f(2) = 0.7$.
\item Now, suppose we believe that these events happen according to another function $q$: $q(1) = 0.25$, $q(2) = 0.75$.
\item How much additional uncertainty have we introduced, as a consequence of using $q$ to approximate $f$?
\item The answer is the the K-L divergence $D_{KL}(f,q)$.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Divergence}
\scriptsize{
\begin{itemize}
\item If $f$ and $q$ are probability mass functions, the K-L divergence is defined as follows:
\begin{equation}
D_{KL}(f,q) = \sum_{i=1}^m f(x_i)(\log f(x_i)-\log q(x_i)) = \sum_{i=1}^m f(x_i)\log \left(\frac{f(x_i)}{q(x_i)}\right)
\end{equation}
\begin{verbatim}
> f<-c(0.3,0.7)
> q<-c(0.25,0.75)
> sum(f* log2(f/ q))
[1] 0.00923535
\end{verbatim}
\item This naturally extends to continuous density functions as well \cite{pml1Book}:
\begin{equation}
D_{KL}(f,q) = \int f(x)\log \left(\frac{f(x)}{q(x)}\right)dx
\end{equation}
\item In plainer language, the divergence is the average difference in log probability between the target (f) and model (q).
\item This divergence is just the difference between two entropies.
\item The entropy of the target distribution $f$ and the cross entropy arising from using $q$ to predict $f$.
\end{itemize}
}
\end{frame}
\begin{frame}{Cross entropy and divergence}
\scriptsize{
\begin{itemize}
\item When we use a probability distribution $q$ to predict events from another
distribution $f$, this defines something known as cross entropy $H(f,q)$:
\begin{equation}
H(f,q) = - \sum_{i=1}^{m}f(x_i)\log q(x_i)
\end{equation}
\item The notion is that events arise according to $f$, but they are expected according to the $q$, so the entropy is inflated, depending upon how different $f$ and $q$ are.
\item Divergence is defined as the additional entropy induced by using $q$.
\item So it is the difference between $H(f)$, the actual entropy of events, and $H(f, q)$:
\begin{equation}
D_{KL}(f,q) = H(f,q)-H(f)
\end{equation}
\item So divergence really is measuring how far q is from the target $f$, in units of entropy.
\item Notice that which is the target matters: $H(f, q)$ does not in general equal $H(q, f)$.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Divergence}
\scriptsize{
\begin{itemize}
\item When $f = q$, we know the actual probabilities of the events and the divergence becomes zero:
\begin{displaymath}
D_{KL}(f,q) = D_{KL}(f,f) = 0
\end{displaymath}
\begin{verbatim}
> q <- f
> sum(f*log2(f/ q))
[1] 0
\end{verbatim}
\item As q grows more different from $f$, the divergence $D_{KL}$ also grows.
\item Suppose the true target distribution is $f = \{0.3, 0.7\}$.
\item Suppose the approximating distribution q can be anything from $q = \{0.01, 0.99\}$ to $q = \{0.99, 0.01\}$.
\item Let's build a plot with $q(1)$ on the horizontal axis and the divergence $D_{KL}(f, q)$ on vertical one.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Divergence}
\scriptsize{
\begin{verbatim}
t <-
tibble(f_1 = .3,
f_2 = .7,
q_1 = seq(from = .01, to = .99, by = .01)) %>%
mutate(q_2 = 1 - q_1) %>%
mutate(d_kl = (f_1 * log2(f_1 / q_1)) + (f_2 * log2(f_2 / q_2)))
t %>%
ggplot(aes(x = q_1, y = d_kl)) +
geom_vline(xintercept = .3, linetype = 2) +
geom_line(size = 1.5) +
annotate(geom = "text", x = .4, y = 1.5, label = "q = f",
size = 3.5) +
labs(x = "q(1)",
y = "Divergence of q from f")
\end{verbatim}
}
\end{frame}
\begin{frame}{Divergence}
\scriptsize{
\begin{figure}[h!]
\centering
\includegraphics[scale=0.45]{pics/divergence.pdf}
\end{figure}
\begin{itemize}
\item Only exactly where $q = f$, at $q(1) = 0.3$, does the divergence achieve a value of zero. Everyplace else, it grows.
\item Since predictive models specify probabilities of events (observations), we can use divergence to compare the accuracy of models.
\end{itemize}
}
\end{frame}
\begin{frame}{Estimating divergence}
\scriptsize{
\begin{itemize}
\item To use $D_{KL}$ to compare models, it seems like we would have to know $f$, the target probability distribution.
\item In all of the examples so far, I've just assumed that $f$ is known.
\item But when we want to find a model $q$ that is the best approximation to $f$, the ``truth,'' there is usually no way to access $f$ directly.
\item We wouldn't be doing statistical inference, if we already knew $f$.
\item But there's an amazing way out of this predicament.
\item It helps that we are only interested in comparing the divergences of different candidates, say $q$ and $r$.
\item In that case, most of $f$ just subtracts out, because there is a $\mathbb{E}(\log f(x))$ term in the divergence of both q and r.
\item This term has no effect on the distance of q and r from one another.
\end{itemize}
}
\end{frame}
\begin{frame}{Estimating divergence}
\scriptsize{
\begin{itemize}
\item So while we don't know where $f$ is, we can estimate how far apart $q$ and $r$ are, and which is closer to the target.
\item All we need to know is a model's average log-probability: $\mathbb{E}(\log q(x))$ for q and $\mathbb{E}(\log r(x))$ for r.
\item To approximate the relative value of $\mathbb{E}(\log q(x))$, we can use the log-probability score of the model:
\begin{equation}
S(q) = \sum_{i=1}^m \log q(x_i)
\end{equation}
\item This sore is an estimate of $\mathbb{E}(\log q(x))$, just without the final step of dividing by the number of observations.
\item It is also an approximation of the K-L divergence of the model relative to the unknown target model.
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Calculating Log-Likelihood}
\scriptsize{
\begin{itemize}
\item Most of the standard model fitting functions in R support ``logLik'' , which computes the sum of log-probabilities, usually known as the log-likelihood of the data.
\begin{verbatim}
> logLik(reg.ev.1)
'log Lik.' -47.46249 (df=3)
> logLik(reg.ev.2)
'log Lik.' -47.13276 (df=4)
\end{verbatim}
\item Recall that the absolute magnitude of these values are not interpretable.
\item Only the difference $S(q) - S(r)$ informs us about the divergence of each model from the unknown target $f$.
\item We can calculate the log-likelihood of a model by hand using the following code:
\begin{verbatim}
b0<-reg.ev.1$coefficients[1]
b1<-reg.ev.1$coefficients[2]
logLik1 <- sum(log(dnorm(
d$brain ,
mean=b0+b1*d$mass ,
sd=sigma(reg.ev.1))
))
> logLik1
[1] -47.64015
\end{verbatim}
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Calculating Log-Likelihood}
\scriptsize{
\begin{itemize}
\item Note that this log-likelihood differs from the one obtained with \verb+logLik+.
\item This is because \verb+sigma(reg.ev.1)+ returns an unbiased estimator of $\sigma$: \begin{displaymath}
\hat{\sigma} = \sqrt{\frac{\sum \epsilon_i^2}{df}}
\end{displaymath}
where $df=n-2=7-2=5$.
\begin{verbatim}
> sigma(reg.ev.1)
[1] 252.0567
> reg.ev.1$df.residual
[1] 5
> sigma.LM <-sqrt(sum(reg.ev.1$residuals^2)/
+ reg.ev.1$df.residual)
> sigma.LM
[1] 252.0567
\end{verbatim}
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Calculating Log-Likelihood}
\scriptsize{
\begin{itemize}
\item On the other hand, the function \verb+logLik+ uses the maximum likelihood estimator of $\sigma$:
\begin{displaymath}
\hat{\sigma}_{ML} = \sqrt{\frac{\sum \epsilon_i^2}{n}}
\end{displaymath}
\begin{verbatim}
sigma.ML<-sqrt(sum(reg.ev.1$residuals^2)
/dim(d)[1])
> sigma.ML
[1] 213.0268
logLik1.1 <- sum(log(dnorm(
d$brain ,
mean=b0+b1*d$mass ,
sd=sigma.ML)
))
> logLik1.1
[1] -47.46249
\end{verbatim}
\end{itemize}
}
\end{frame}
\begin{frame}[fragile]{Calculating Log-Likelihood}
\scriptsize{
\begin{itemize}
\item We can obtain the same results using MAP estimates from a Bayesian model with flat priors:
\begin{verbatim}
library(rethinking)
b.reg.ev.1<- quap(
alist(
brain ~ dnorm( mu , sigma ) ,
mu <- b0 + b1*mass
) ,
data=d ,