-
Notifications
You must be signed in to change notification settings - Fork 2
/
04-pca-biplot.qmd
1807 lines (1447 loc) · 91.5 KB
/
04-pca-biplot.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r include=FALSE}
source(here::here("R", "common.R"))
knitr::opts_chunk$set(fig.path = "figs/ch04/")
```
{{< include latex/latex-commands.qmd >}}
# Dimension Reduction {#sec-pca-biplot}
## _Flatland_ and _Spaceland_ {#sec-spaceland}
> It is high time that I should pass from these brief and discursive notes about Flatland to the central event of this book, my initiation into the mysteries of Space. THAT is my subject; all that has gone before is merely preface --- Edwin Abbott, _Flatland_, p. 57.
There was a cloud in the sky above _Flatland_ one day. But it was a huge, multidimensional cloud of sparkly points that might contain some important message, perhaps like the hidden EUREKA (@fig-pollen-eureka), or perhaps forecasting the upcoming harvest, if only Flatlanders could appreciate it.
A leading citizen, A SQUARE, who had traveled once to
Spaceland and therefore had an inkling of its majesty beyond the simple world of his life in the plane looked at that cloud and had a brilliant thought, an OMG moment:
> "Oh, can I, in my imagination, rotate that cloud and squeeze its juice so that it rains down on Flatland with greatest joy?"
As it happened, our Square friend, although he could never really _see_ in three dimensions, he could now
at least _think_ of a world described by **height** as well as breadth and width, and think of the
**shadow** cast by a cloud as something mutable, changing size and shape depending on its' orientation over Flatland.
And what a world it was, inhabited by
Pyramids, Cubes and wondrous creatures called Polyhedrons with many $C$orners, $F$aces and $E$dges.
Not only that, but all those Polyhedra were forced in Spaceland to obey a magic formula:
$C + F - E = 2$.[^1-euler] How cool was that!
[^1-euler]: This is Euler's [-@Euler:1758] formula, which states that any convex polyhedron must obey the formula $V + F - E = 2$ where $V$ is the number of vertexes (corners), $F$ is the number of faces and $E$ is the number of edges. For example, a tetrahedron
or pyramid has $(V, F, E) = (4, 4, 6)$ and a cube has $(V, F, E) = (8, 6, 12)$. Stated in words,
for all solid bodies confined by planes, the sum of the number of vertexes and the number of faces is two less than the number of edges.
Indeed, there were even exalted Spheres,
having so many faces that its surface became as smooth as a baby's bottom with no need for pointed corners or edges, just as Circles were the smoothest occupants of his world with far too many sides to count.
It was his dream of a Sphere passing through Flatland (@fig-flatland-spheres) that first awakened him to
a third dimension.
He also marveled at Ellipsoids, as smooth as Spheres, but in Spaceland having three natural axes of different extent
and capable of being appearing fatter or slimmer when rotated from different views. An Ellipsoid
had magical properties: it could appear as so thin in one or more dimensions that it became a simple
2D ellipse, or a 1D line, or even a 0D point [@Friendly-etal:ellipses:2013].
<!-- **TODO**: somehow mention the `gellipsoid` package here. -->
All of these now arose in Square's richer 3D imagination.
And, all of this came from just one more dimension than his life in Flatland.
### Multivariate juicers
Up to now, we have also been living in Flatland. We have been trying to understand data in
**data space** of possibly many dimensions, but confined to the 2D plane of a graph window.
Scatterplot matrices and parallel coordinate plots provided some relief.
The former did so by **projecting** the data into sets of 2D views in the coordinates of data
space; the latter did so by providing multiple axes in a 2D space along which we could trace
the paths of individual observations.
This chapter is about seeing data in a different projection, a low-dimensional (usually 2D) space
that squeezes out the most juice
from multidimensional data for a particular purpose (@fig-MV-juicer), where what we want to
understand can be more easily seen.
```{r}
#| label: fig-MV-juicer
#| echo: false
#| fig-align: center
#| out-width: "90%"
#| fig-cap: "A multivariate juicer takes data from possibly high-dimensional data space and transforms it to a lower-dimenional space in which important effects can be more easily seen."
knitr::include_graphics("images/MV-juicer.png")
```
Here, I concentrate on **principal components analysis** (PCA), whose goal reflects A Square's desire to see that sparkly cloud of points
in $nD$ space in the plane showing the greatest variation (squeezing the most juice)
among all other possible views.
This appealed to his sense of geometry, but left him wondering how the variables in that
high-D cloud were related to the dimensions he could see in a best-fitting plane.
The idea of a **biplot**, showing the data points in the plane, together with thick pointed
arrows---variable vectors--- in one view is the other topic explained in this chapter (@sec-biplot). The biplot is the simplest example of a multivariate juicer. The essential idea is to project the cloud of data points in $n$ dimensions into the
2D space of principal components and simultaneously show how the original variables
relate to this space. For exploratory analysis to get an initial, incisive view of
a multivariate dataset, a biplot is often my first choice.
::: {.callout-note title="Looking ahead"}
I'm using the term _multivariate juicer_ here to refer the wider class of **dimension reduction** techniques, used for various purposes in data analysis and visualization. PCA is the simplest example and illustrates the general ideas.
The key point is that these methods are designed to transform the data into a low-dimensional space for a particular goal or purpose. In PCA, the goal is to extract the greatest amount of total variability in the data. In the context of univariate multiple regression, the goal is often to reduce the number of predictors necessary to account for an outcome variable, called _feature extraction_ in the machine learning literature.
When the goal is to best distinguish among groups **discriminant analysis** finds uncorrelated weighted sums of
predictors on which the means of groups are most widely separated in a reduced space of hopefully fewer dimensions.
The methods I cover in this book are all linear methods, but there is also a wide variety of non-linear dimension reduction techniques.
:::
**Packages**
In this chapter I use the following packages. Load them now:
```{r load-pkgs}
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(ggbiplot)
library(FactoMineR)
library(factoextra)
library(car)
library(ggpubr)
```
## Principal components analysis (PCA) {#sec-pca}
When Francis Galton [-@Galton:1886] first discovered the idea of regression toward the mean
and presented his famous diagram (@fig-galton-corr), he had little thought that he had provided a
window to a higher-dimensional world, beyond what even A Square could imagine.
His friend, Karl Pearson [-@Pearson:1896] took that idea and developed it into a theory of
regression and a measure of correlation that would bear his name, Pearson's $r$.
But then Pearson [-@Pearson:1901] had a further inspiration, akin to that of A Square.
If he also had a cloud of sparkly points in $2, 3, 4, ..., p$ dimensions, could he find a
point ($0D$), or line ($1D$), or plane ($2D$), or even a hyperplane ($nD$) that best summarized ---
squeezed out the most juice---from multivariate data? This was the first truly multivariate
problem in the history of statistics [@FriendlyWainer:2021:TOGS, p. 186].
The best $0D$ point was easy--- it was simply the centroid, the means of each of the
variables in the data, $(\bar{x}_1, \bar{x}_2, ..., \bar{x}_p)$, because that was "closest"
to the data in the sense of minimizing the sum of squared differences, $\Sigma_i\Sigma_j (x_{ij} - \bar{x}_j)^2$.
In higher dimensions, his solution was also an application of the method of least squares, but he argued
it geometrically and visually as shown in @fig-Pearson1901.
```{r}
#| label: fig-Pearson1901
#| echo: false
#| fig-align: center
#| out-width: "70%"
#| fig-cap: "Karl Pearson's (1901) geometric, visual argument for finding the line or plane of closest fit to a collection of points, P1, P2, P3, ... "
knitr::include_graphics("images/Pearson1901.png")
```
For a $1D$ summary, the line of best fit to the points $P_1, P_2, \dots P_n$ is the line that goes through the
centroid and made the average squared length of the _perpendicular_ segments from those points to a line as small
as possible. This was different from the case in linear regression, for fitting $y$ from $x$,
where the average squared length of the _vertical_ segments, $\Sigma_i (y_i - \hat{y}_i)^2$ was minimized by
least squares.
He went on to prove the visual insights from simple smoothing of @Galton:1886 (shown in @fig-galton-corr)
regarding the regression lines of
`y ~ x` and `x ~ y`. More importantly, he proved that the cloud of points is captured,
for the purpose of finding a best line, plane or hyperplane, by
the ellipsoid that encloses it, as seen in his diagram, @fig-Pearson1901-2. The major axis of the
2D ellipse is the line of best fit, along which the data points have the smallest average squared
distance from the line. The axis at right angles to that---the minor axis--- is labeled "line of worst fit"
with the largest average squared distance.
```{r}
#| label: fig-Pearson1901-2
#| echo: false
#| fig-align: center
#| out-width: "80%"
#| fig-cap: "Karl Pearson's diagram showing the elliptical geometry of regression and principal components analysis ... _Source_: Pearson (1901), p. 566."
knitr::include_graphics("images/Pearson1901_2.png")
```
Even more importantly--- and this is the basis for PCA --- he recognized that the two orthogonal axes of the ellipse gave new coordinates for the data which were uncorrelated, whatever the correlation of $x$ and $y$.
> Physically, the axes of the correlation type-ellipse are the directions of independent and uncorrelated variation. --- @Pearson:1901, p. 566.
It was but a small step to recognize that for two variables, $x$ and $y$:
* The line of best fit, the major axis (PC1) had the greatest variance of points projected onto it;
* the line of worst fit, the minor axis (PC2), had the least variance;
* these could be seen as a rotation of the data space of $(x, y)$ to a new space (PC1, PC2) with uncorrelated variables;
* the total variation of the points in data space, $\text{Var}(x) + \text{Var}(y)$, being unchanged by rotation, was equally well expressed as the total variation $\text{Var}(PC1) + \text{Var}(PC2)$ of the scores on what are
now called the principal component axes.
It would have appealed to Pearson (and also to A Square) to see these observations demonstrated in a 3D video. @fig-pca-animation
shows a 3D plot of the variables `Sepal.Length`, `Sepal.Width` and `Petal.Length` in Edgar Anderson's `iris`
data, with points colored by species and the 95% data ellipsoid. This is rotated smoothly by interpolation until the first two principal axes, PC1 and PC2 are aligned with the horizontal and vertical dimensions.
Because this is a rigid rotation of the cloud of points, the total variability is obviously unchanged.
<!-- ::: {#fig-pca-animation} -->
<!-- <div align="center"> -->
<!-- <iframe width="946" height="594" src="images/pca-animation1.gif"></iframe> -->
<!-- </div> -->
<!-- Animation of PCA as a rotation in 3D space. The plot shows three variables for the `iris` data, initially -->
<!-- in data space and its' data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions. -->
<!-- ::: -->
::: {.content-visible unless-format="pdf"}
```{r}
#| label: fig-pca-animation
#| out-width: "100%"
#| echo: false
#| fig-cap: "Animation of PCA as a rotation in 3D space. The plot shows three variables for the `iris` data, initially in data space and its' data ellipsoid, with points colored according to species of the iris flowers. This is rotated smoothly until the first two principal axes are aligned with the horizontal and vertical dimensions."
knitr::include_graphics("images/pca-animation1.gif")
```
:::
### PCA by springs
Before delving into the mathematics of PCA, it is useful to see how Pearson's problem, and fitting by least squares generally, could be solved in a physical realization.
From elementary statistics, you may be familiar with a
physical demonstration that the mean, $\bar{x}$, of a sample is the value for which the sum of deviations,
$\Sigma_i (x_i - \bar{x})$ is zero, so the mean can be visualized as the point of balance on a line where
those differences $(x_i - \bar{x})$ are placed. Equally well, there is a physical realization of the mean
as the point along an axis where weights connected by springs will minimize the sum of squared differences,
because springs with a constant stiffness, $k$, exert forces proportional to $k (x_i - \bar{x}) ^2$. That's
the reason it is useful as a measure of central tendency: it minimizes the average squared error.
In two dimensions, imagine that we have points, $(x_i, y_i)$ and these are attached by springs of equal stiffness $k$, to a line anchored at the centroid, $(\bar{x}, \bar{y})$ as shown in @fig-pca-springs.
If we rotate the line to some initial position and release it, the springs will pull the line clockwise or counterclockwise and the line will bounce around until the forces, proportional to the squares of the lengths of the springs, will eventually balance out at the position
(shown by the `r colorize("red")` fixed line segments at the ends). This is the position that minimizes the
the sum of squared lengths of the connecting springs, and also minimizes the kinetic energy in the system.
If you look closely at @fig-pca-springs you will see something else: When the line is at its final position of minimum
squared length and energy, the positions of the `r colorize("red")` points on this line are spread out furthest, i.e., have **maximum** variance. Conversely, when the line is at right angles to its final position (shown by the black line at 90$^o$) the projected points have the smallest possible variance.
::: {#fig-pca-springs}
<div align="center">
<iframe width="412" height="364" src="images/pca-springs-cropped.gif"></iframe>
</div>
Animation of PCA fitted by springs. The `r colorize("blue")` data points are connected to their projections on the `r colorize("red")` line by springs perpendicular to that line. From an initial position, the springs pull that line in proportion to their squared distances, until the line finally settles down to the position where the forces are balanced and the minimum is achieved. _Source_: Amoeba, https://bit.ly/46tAicu.
:::
<!-- See: https://joshualoftus.com/posts/2020-11-23-least-squares-as-springs/least-squares-as-springs.html -->
<!-- **TODO**: Simple PCA example using workers data -->
### Mathematics and geometry of PCA
As the ideas of principal components developed, there was a happy marriage of Galton's geometrical intuition and Pearson's mathematical analysis. The best men at the wedding were ellipses and
higher-dimensional ellipsoids. The bridesmaids were eigenvectors, pointing in as many
different directions as space would allow, each sized according to their associated eigenvalues.
Attending the wedding were the ghosts of uncles, Leonhard Euler, Jean-Louis Lagrange,
Augustin-Louis Cauchy and others who had earlier discovered the mathematical properties
of ellipses and quadratic forms in relation to problems in physics.
The key idea in the statistical application was that, for a set of variables $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_p$,
the $p \times p$ covariance matrix $\mathbf{S}$ could be expressed **exactly** as a matrix
product involving a matrix $\mathbf{V}$, whose columns are _eigenvectors_ ($\mathbf{v}_i$) and a
diagonal matrix $\mathbf{\Lambda}$, whose diagonal elements ($\lambda_i$) are the corresponding _eigenvalues_.
To explain this, it is helpful to use a bit of matrix math:
\begin{align*}
\mathbf{S}_{p \times p} & = \mathbf{V}_{p \times p} \quad\quad \mathbf{\Lambda}_{p \times p} \quad\quad \mathbf{V}_{p \times p}^\textsf{T} \\
& = \left( \mathbf{v}_1, \, \mathbf{v}_2, \,\dots, \, \mathbf{v}_p \right)
\begin{pmatrix}
\lambda_1 & & & \\
& \lambda_2 & & \\
& & \ddots & \\
& & & \lambda_p
\end{pmatrix}
\;
\begin{pmatrix}
\mathbf{v}_1^\textsf{T}\\
\mathbf{v}_2^\textsf{T}\\
\vdots\\
\mathbf{v}_p^\textsf{T}\\
\end{pmatrix}
\\
& = \lambda_1 \mathbf{v}_1 \mathbf{v}_1^\textsf{T} + \lambda_2 \mathbf{v}_2 \mathbf{v}_2^\textsf{T} + \cdots + \lambda_p \mathbf{v}_p \mathbf{v}_p^\textsf{T}
\end{align*}
In this equation,
* The last line follows because $\mathbf{\Lambda}$ is a diagonal matrix, so $\mathbf{S}$ is expressed as a sum of outer products of each $\mathbf{v}_i$ with itself.
* The columns of $\mathbf{V}$ are the eigenvectors of $\mathbf{S}$. They are orthogonal
and of unit length, so $\mathbf{V}^\textsf{T} \mathbf{V} = \mathbf{I}$ and thus
they represent orthogonal (uncorrelated) directions in data space.
* The columns $\mathbf{v}_i$ are the weights applied to the variables to produce the scores on
the principal components. For example, the first principal component is the weighted sum:
$$
PC1 = v_{11} \mathbf{x}_1 + v_{12} \mathbf{x}_2 + \cdots + v_{1p} \mathbf{x}_p
$$
* The eigenvalues, $\lambda_1, \lambda_2, \dots, \lambda_p$ are the variances of the the components, because
$\mathbf{v}_i^\textsf{T} \;\mathbf{S} \; \mathbf{v}_i = \lambda_i$.
* It is usually the case that the variables $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_p$ are linearly
independent, which means that none of these is an exact linear combination of the others. In this case,
all eigenvalues $\lambda_i$ are positive and the covariance matrix $\mathbf{S}$ is said to have **rank** $p$.
* Here is the key fact: If the eigenvalues are arranged in order, so that $\lambda_1 > \lambda_2 > \dots > \lambda_p$, then the first $d$ components give a $d$-dimensional approximation to $\mathbf{S}$, which accounts for $\Sigma_i^d \lambda_i / \Sigma_i^p \lambda_i$ of the total variance.
For the case of two variables, $\mathbf{x}_1$ and $\mathbf{x}_2$ @fig-pca-rotation shows the
transformation from data space to component space. The eigenvectors, $\mathbf{v}_1, \mathbf{v}_2$
are the major and minor
axes of the data ellipse, whose lengths are the square roots $\sqrt{\lambda_1}, \sqrt{\lambda_2}$
of the eigenvalues.
```{r}
#| label: fig-pca-rotation
#| out-width: "100%"
#| echo: false
#| fig-cap: "Geometry of PCA as a rotation from data space to principal component space, defined by the eigenvectors v1 and v2 of a covariance matrix"
knitr::include_graphics("images/pca-rotation.png")
```
<!-- UDI: I like this figure, but I would change the blue curved arrow. What do you think? I can put together a few drafts using Inkscape. Let me know if you agree and if you have any suggestions. -->
### Finding principal components
In R, PCA is most easily carried out using `stats::prcomp()` or `stats::princomp()`
or similar functions in other packages such as `FactomineR::PCA()`.
The **FactoMineR** package [@R-FactoMineR; @Husson-etal-2017]
has extensive capabilities for exploratory analysis of multivariate data (PCA, correspondence analysis, cluster analysis).
A particular strength of FactoMineR
for PCA is that it allows the inclusion of _supplementary variables_ (which can be categorical
or quantitative) and _supplementary points_ for individuals. These are not used in the analysis,
but are projected into the plots to facilitate interpretation. For example, in the analysis of
the `crime` data described below, it would be useful to have measures of other characteristics
of the U.S. states, such as poverty and average level of education (@sec-supp-vars).
Unfortunately, although all of these functions perform similar calculations, the options for
analysis and the details of the result they return differ.
The important options for analysis include:
* whether or not the data variables are **centered**, to a mean of $\bar{x}_j =0$
* whether or not the data variables are **scaled**, to a variance of $\text{Var}(x_j) =1$.
It nearly always makes sense to center the variables. The choice of
scaling determines whether the correlation matrix is analyzed, so that
each variable contributes equally to the total variance that is to be accounted for
versus analysis of the covariance matrix, where each variable contributes its
own variance to the total. Analysis of the covariance matrix makes little sense
when the variables are measured on different scales.[^pca-scales]
[^pca-scales]: For example, if two variables in the analysis are height and weight, changing the unit of height from inches to centimeters would multiply its variance by $2.54^2$; changing weight from pounds to kilograms would divide its variance by $2.2^2$.
#### Example: Crime data {.unnumbered}
The dataset `crime`, analysed in @sec-corrgram, showed all positive correlations among the rates of various
crimes in the corrgram, @fig-crime-corrplot. What can we see from a PCA?
Is it possible that a few dimensions can account for most of the juice in this data?
In this example, you can easily find the PCA solution using `prcomp()` in a single line in base-R.
You need to specify the numeric variables to analyze by their columns in the data frame.
The most important option here is `scale. = TRUE`.
```{r crime-pca0}
#| eval: false
data(crime, package = "ggbiplot")
crime.pca <- prcomp(crime[, 2:8], scale. = TRUE)
```
The tidy equivalent is more verbose, but also more expressive about what is being done.
It selects the variables to analyze by a function, `is.numeric()` applied to each of the columns and feeds
the result to `prcomp()`.
```{r crime-pca}
crime.pca <-
crime |>
dplyr::select(where(is.numeric)) |>
prcomp(scale. = TRUE)
```
As is typical with models in R, the result, `crime.pca` of `prcomp()` is an object of class `"prcomp"`,
a list of components, and there are a variety of methods for `"prcomp"` objects. Among the simplest
is `summary()`, which gives the contributions of each component to the total variance in the dataset.
```{r crime-pca-summary}
summary(crime.pca) |> print(digits=2)
```
The object, `crime.pca` returned by `prcomp()` is a list of the following the following elements:
```{r crime-pca-components}
names(crime.pca)
```
Of these, for $n$ observations and $p$ variables,
* `sdev` is the length $p$ vector of the standard deviations of the principal components (i.e., the square roots $\sqrt{\lambda_i}$ of the eigenvalues of the covariance/correlation matrix). When the variables are standardized,
the sum of squares of the eigenvalues is equal to $p$.
* `rotation` is the $p \times p$ matrix of weights or **loadings** of the variables on the components; the columns are the eigenvectors of the covariance or correlation matrix of the data;
* `x` is the $n \times p$ matrix of **scores** for the observations on the components, the result of multiplying (rotating) the data matrix by the loadings. These are uncorrelated, so `cov(x)` is a $p \times p$ diagonal matrix whose diagonal elements are the eigenvalues $\lambda_i$ = `sdev^2`.
### Visualizing variance proportions: screeplots
For a high-D dataset, such as the crime data in seven dimensions, a natural question is how much of the
variation in the data can be captured in 1D, 2D, 3D, ... summaries and views. This is answered by
considering the proportions of variance accounted by each of the dimensions, or their cumulative values.
The components returned by various PCA methods have (confusingly) different names, so
`broom::tidy()` provides methods to unify extraction of these values.
```{r crime-pca-tidy}
(crime.eig <- crime.pca |>
broom::tidy(matrix = "eigenvalues"))
```
Then, a simple visualization is a plot of the proportion of variance for each component (or cumulative proportion)
against the component number, usually called a **screeplot**. The idea, introduced by @Cattell1966, is that
after the largest, dominant components, the remainder should resemble the rubble, or scree formed by rocks falling
from a cliff.
From this plot, imagine
drawing a straight line through the plotted eigenvalues, starting with the largest one. The typical rough guidance is that the last point to fall on this line represents the last component to extract, the idea being that beyond this, the amount of additional variance explained is non-meaningful. Another rule of thumb is to choose the number
of components to extract a desired proportion of total variance, usually in the range of 80 - 90%.
`stats::plot(crime.pca)` would give a bar plot of the variances of the components, however `ggbiplot::ggscreeplot()` gives nicer and more flexible displays as shown in @fig-crime-ggscreeplot.
```{r}
#| label: fig-crime-ggscreeplot
#| fig-height: 4
#| out-width: "100%"
#| fig-cap: "Screeplots for the PCA of the crime data. The left panel shows the traditional version, plotting variance proportions against component number, with linear guideline for the scree rule of thumb. The right panel plots cumulative proportions, showing cutoffs of 80%, 90%."
p1 <- ggscreeplot(crime.pca) +
stat_smooth(data = crime.eig |> filter(PC>=4),
aes(x=PC, y=percent), method = "lm",
se = FALSE,
fullrange = TRUE) +
theme_bw(base_size = 14)
p2 <- ggscreeplot(crime.pca, type = "cev") +
geom_hline(yintercept = c(0.8, 0.9), color = "blue") +
theme_bw(base_size = 14)
p1 + p2
```
From this we might conclude that four components are necessary to satisfy the scree criterion or to
account for 90% of the total variation in these crime statistics. However two components, giving
76.5%, might be enough juice to tell a reasonable story.
### Visualizing PCA scores and variable vectors
To see and attempt to understand PCA results, it is useful to plot both the scores for the observations
on a few of the largest components and also the loadings or variable vectors that give the weights
for the variables in determining the principal components.
In @sec-biplot I discuss the biplot technique that plots both in a single display. However,
I do this directly here, using tidy processing to explain what is going on in PCA
and in these graphical displays.
#### Scores {.unnumbered}
The (uncorrelated) principal component scores can be extracted as `crime.pca$x` or using `purrr::pluck("x")`. As noted above, these are uncorrelated and have variances
equal to the eigenvalues of the correlation matrix.
```{r scores}
scores <- crime.pca |> purrr::pluck("x")
cov(scores) |> zapsmall()
```
For plotting, it is more convenient to use `broom::augment()` which
extracts the scores (named `.fittedPC*`)
and appends these to the variables in the dataset.
```{r}
crime.pca |>
broom::augment(crime) |> head()
```
Then, we can use `ggplot()` to plot any pair of components.
To aid interpretation, I label the points by their state abbreviation and color them
by `region` of the U.S.. A geometric interpretation of the plot requires
an aspect ratio of 1.0 (via `coord_fixed()`)
so that a unit distance on the horizontal axis is the
same length as a unit distance on the vertical.
To demonstrate that the components are uncorrelated, I also added their data
ellipse.
```{r}
#| label: fig-crime-scores-plot12
#| out-width: "100%"
#| fig-cap: "Plot of component scores on the first two principal components for the `crime` data. States are colored by `region`."
crime.pca |>
broom::augment(crime) |> # add original dataset back in
ggplot(aes(.fittedPC1, .fittedPC2, color = region)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_point(size = 1.5) +
geom_text(aes(label = st), nudge_x = 0.2) +
stat_ellipse(color = "grey") +
coord_fixed() +
labs(x = "PC Dimension 1", y = "PC Dimnension 2") +
theme_minimal(base_size = 14) +
theme(legend.position = "top")
```
To interpret such plots, it is useful consider the observations that are a high
and low on each of the axes as well as other information, such as region here,
and ask how these differ on the crime statistics.
The first component, PC1, contrasts Nevada and California with North Dakota, South Dakota
and West Virginia. The second component has most of the southern states on the low end
and Massachusetts, Rhode Island and Hawaii on the high end. However, interpretation is
easier when we also consider how the various crimes contribute to these dimensions.
When, as here, there
are more than two components that seem important in the scree plot,
we could obviously go further and plot other pairs.
#### Variable vectors {.unnumbered}
You can extract the variable loadings using either `crime.pca$rotation` or
`purrr::pluck("rotation")`, similar to what I did with the scores.
```{r rotation}
crime.pca |> purrr::pluck("rotation")
```
But note something important in this output: All of the weights for the first component are negative. In PCA, the directions of the eigenvectors are completely arbitrary, in the sense that the vector $-\mathbf{v}_i$ gives the same linear combination as $\mathbf{v}_i$,
but with its' sign reversed. For interpretation, it is useful (and usually recommended) to reflect the loadings to a positive orientation by multiplying them by -1.
To reflect the PCA loadings and get them into a convenient format for plotting with `ggplot()`, it is necessary to do a bit of processing, including making the `row.names()` into an explicit variable for the purpose of labeling.
**TODO**: Should this be a callout-warning or a footnote??
::: {.callout-warning title="`rownames` in R"}
R software evolved over many years, particularly in conventions for labeling cases in printed output and graphics. In base-R, the convention was that the `row.names()` of a matrix or data.frame served as observation labels in all printed output and plots, with a default to use numbers `1:n` if there were no rownames. In `ggplot2` and the `tidyverse` framework, the decision was made that observation labels had to be an **explicit** variable in a "tidy" dataset, so it could be used as a variable in constructs like
`geom_text(aes(label = label))` as in this example. This change often requires extra steps in software that uses the rownames convention.
:::
```{r vectors}
vectors <- crime.pca |>
purrr::pluck("rotation") |>
as.data.frame() |>
mutate(PC1 = -1 * PC1, PC2 = -1 * PC2) |> # reflect axes
tibble::rownames_to_column(var = "label")
vectors[, 1:3]
```
Then, I plot these using `geom_segment()`, taking some care to use arrows
from the origin with a nice shape and add `geom_text()` labels for the variables positioned slightly to the right. Again, `coord_fixed()` ensures equal scales for the axes, which is important because we want to interpret the angles between the variable vectors and the
PCA coordinate axes.
```{r}
#| label: fig-crime-vectors
#| out-width: "80%"
#| fig-cap: "Plot of component loadings the first two principal components for the `crime` data. These are interpreted as the contributions of the variables to the components."
arrow_style <- arrow(
angle = 20, ends = "first", type = "closed",
length = grid::unit(8, "pt")
)
vectors |>
ggplot(aes(PC1, PC2)) +
geom_hline(yintercept = 0) +
geom_vline(xintercept = 0) +
geom_segment(xend = 0, yend = 0,
linewidth = 1,
arrow = arrow_style,
color = "brown") +
geom_text(aes(label = label),
size = 5,
hjust = "outward",
nudge_x = 0.05,
color = "brown") +
ggforce::geom_circle(aes(x0 = 0, y0 = 0, r = 0.5), color = gray(.50)) +
xlim(-0.5, 0.9) +
ylim(-0.8, 0.8) +
coord_fixed() + # fix aspect ratio to 1:1
theme_minimal(base_size = 14)
```
The variable vectors (arrows) shown in @fig-crime-vectors have the following interpretations:
(1) The lengths of the variable vectors, $\lVert\mathbf{v}_i\rVert = \sqrt{\Sigma_{j} \; v_{ij}^2}$ give the relative proportion of variance of each variable accounted for in a two-dimensional display.
(2) Each vector points in the direction in component space with which that variable is most highly correlated:
the value, $v_{ij}$, of the vector for variable $\mathbf{x}_i$ on component $j$ reflects
the correlation of that variable with the $j$th principal component. Thus,
* A Variable that is perfectly correlated with a component is parallel to it.
* A variable this is uncorrelated with an component is perpendicular to it.
(3) The angle between vectors shows the strength and direction of the correlation between those variables:
the cosine of the angle between two variable vectors, $\mathbf{v}_i$ and $\mathbf{v}_j$
gives the approximation of the correlation between $\mathbf{x}_i$ and $\mathbf{x}_j$
that is shown in this space. This means that
* two variable vectors that point in the same direction are highly correlated; $r = 1$ if they are completely aligned.
* Variable vectors at right angles are approximately uncorrelated, while those
pointing in opposite directions are negatively correlated; $r = -1$ if they are at 180$^o$.
To illustrate point (1), the following indicates that almost 70% of the variance of
`murder` is represented in the the 2D plot shown in @fig-crime-scores-plot12, but only
40% of the variance of `robbery` is captured. For point (2), the correlation of `murder`
with the dimensions is 0.3 for PC1 and 0.63 for PC2. For point (3), the angle between
`murder` and `burglary` looks to be about 90$^o$, but the actual correlation is 0.39.
<!-- • Each vector corresponds to a variable, and points in the direction in which that variable is most strongly linearly correlated. -->
<!-- • Variables that are perfectly correlated with an axis are parallel to it. -->
<!-- • Variables that are uncorrelated with an axis are perpendicular to it. -->
<!-- • Each vector can be thought of as the hypotenuse of a triangle with sides along the two axes of the graph. The length of these sides is proportional to the loading for that variable with those PCs. -->
<!-- • The angle between vectors shows the strength and direction of the correlation between those variables. You should be able to find examples of each of these cases in the above biplot: -->
<!-- o Vectors that are very close to one another represent variables that are strongly positively correlated -->
<!-- o Vectors that are perpendicular represent variables that are uncorrelated -->
<!-- o Vectors that point in opposite directions represent variables that are strongly negatively correlated -->
```{r}
vectors |> select(label, PC1, PC2) |>
mutate(length = sqrt(PC1^2 + PC2^2))
```
## Biplots {#sec-biplot}
The biplot is a simple and powerful idea that came from the recognition that
you can overlay a plot of observation scores in a principal components analysis
with the information of the variable loadings (weights) to give a simultaneous display
that is easy to interpret. In this sense, a biplot is generalization of a scatterplot,
projecting from data space to PCA space, where the observations are shown by points, as in the plots of
component scores in @fig-crime-scores-plot12, but with the
variables also shown by vectors (or scaled linear axes aligned with those vectors).
The idea of the biplot was introduced by Ruben Gabriel [-@Gabriel:71;-@Gabriel:81] and later expanded in scope by @GowerHand:96. The book by @Greenacre:2010:biplots gives a practical overview
of the many variety of biplots and @Gower-etal:2011 provide a full treatment ...
Biplot methodolgy is far more general than I cover here. Categorical variables can be incorporated in PCA
using category level points.
Two-way frequency tables of categorical variables
can be analysed using _correspondence analysis_, which is similar to
PCA, but designed to account for the maximum amount of the $\chi^2$ statistic
for association; _multiple correspondence analysis_ extends this to
method to multi-way tables [@FriendlyMeyer:2016:DDAR;@Greenacre:84].
### Constructing a biplot
The biplot is constructed by using the singular value decomposition (SVD) to obtain a low-rank approximation to the data matrix $\mathbf{X}_{n \times p}$ (centered, and optionally scaled to unit variances)
whose $n$ rows are the observations and whose $p$ columns are the variables.
```{r}
#| label: fig-svd-diagram
#| echo: false
#| out-width: "80%"
#| fig-cap: "The singular value decomposition expresses a data matrix **X** as the product of a matrix **U** of observation scores, a diagonal matrix $\\Lambda$ of singular values and a matrix **V** of variable weights."
knitr::include_graphics("images/SVD.png")
```
Using the SVD, the matrix $\mathbf{X}$, of rank $r \le p$
can be expressed _exactly_ as:
$$
\mathbf{X} = \mathbf{U} \mathbf{\Lambda} \mathbf{V}'
= \sum_i^r \lambda_i \mathbf{u}_i \mathbf{v}_i' \; ,
$$ {#eq-svd1}
where
* $\mathbf{U}$ is an $n \times r$ orthonormal matrix of uncorrelated observation scores; these are also the
eigenvectors of $\mathbf{X} \mathbf{X}'$,
* $\mathbf{\Lambda}$ is an $r \times r$ diagonal matrix of singular values,
$\lambda_1 \ge \lambda_2 \ge \cdots \lambda_r$, which are also the square roots
of the eigenvalues of $\mathbf{X} \mathbf{X}'$.
* $\mathbf{V}$ is an $r \times p$ orthonormal matrix of variable weights and also the
eigenvectors of $\mathbf{X}' \mathbf{X}$.
Then, a rank 2 (or 3) PCA approximation $\widehat{\mathbf{X}}$ to the data matrix used in the biplot can be obtained from the first 2 (or 3)
singular values $\lambda_i$ and the corresponding $\mathbf{u}_i, \mathbf{v}_i$ as:
$$
\mathbf{X} \approx \widehat{\mathbf{X}} = \lambda_1 \mathbf{u}_1 \mathbf{v}_1' + \lambda_2 \mathbf{u}_2 \mathbf{v}_2' \; .
$$
The variance of $\mathbf{X}$ accounted for by each term is $\lambda_i^2$.
A biplot is then obtained by overlaying two scatterplots that share a common set of axes and have a between-set scalar
product interpretation. Typically, the observations (rows of $\mathbf{X}$) are represented as points
and the variables (columns of $\mathbf{X}$) are represented as vectors from the origin.
The `scale` factor, $\alpha$ allows the variances of the components to be apportioned between the
row points and column vectors, with different interpretations, by representing the approximation
$\widehat{\mathbf{X}}$ as the product of two matrices,
$$
\widehat{\mathbf{X}} = (\mathbf{U} \mathbf{\Lambda}^\alpha) (\mathbf{\Lambda}^{1-\alpha} \mathbf{V}') = \mathbf{A} \mathbf{B}'
$$
This notation uses a little math trick involving a power, $0 \le \alpha \le 1$:
When $\alpha = 1$, $\mathbf{\Lambda}^\alpha = \mathbf{\Lambda}^1 =\mathbf{\Lambda}$,
and $\mathbf{\Lambda}^{1-\alpha} = \mathbf{\Lambda}^0 =\mathbf{I}$.
$\alpha = 1/2$ gives the diagonal matrix $\mathbf{\Lambda}^{1/2}$ whose elements are the square roots of the singular values.
The choice $\alpha = 1$ assigns the singular values totally to the left factor;
then, the angle between two variable vectors, reflecting the inner product
$\mathbf{x}_j^\textsf{T}, \mathbf{x}_{j'}$ approximates their correlation or covariance,
and the distance between the points approximates their Mahalanobis distances.
$\alpha = 0$ gives a distance interpretation to the column display.
$\alpha = 1/2$ gives a symmetrically scaled biplot.
*TODO**: Explain this better.
When the singular values are assigned totally to the left or to the right factor, the resultant
coordinates are called _principal coordinates_ and the sum of squared coordinates
on each dimension equal the corresponding singular value.
The other matrix, to which no part of the singular
values is assigned, contains the so-called _standard coordinates_ and have sum of squared
values equal to 1.0.
### Biplots in R
There are a large number of R packages providing biplots. The most basic, `stats::biplot()`, provides methods for `"prcomp"` and `"princomp"` objects. Among other packages, **factoextra** package [@R-factoextra], an extension of
**FactoMineR** [@R-FactoMineR], is perhaps the most comprehensive and provides `ggplot2` graphics.
In addition to biplot methods for
quantitative data using PCA (`fviz_pca()`), it offers biplots for categorical data using correspondence analysis
(`fviz_ca()`) and multiple correspondence analysis (`fviz_mca()`); factor analysis with mixed quantitative and categorical variables (`fviz_famd()`) and cluster analysis (`fviz_cluster()`).
**TODO**: Also mention **adegraphics** package
Here, I use the **ggbiplot** package, which aims to provide a simple interface to biplots within the `ggplot2` framework. I also use some convenient utility functions from **factoextra**.
### Example: Crime data
A basic biplot of the `crime` data, using standardized principal components and labeling the observation by their state abbreviation is shown in @fig-crime-biplot1.
The correlation circle indicates that these components are uncorrelated and have
equal variance in the display.
```{r}
#| label: fig-crime-biplot1
#| out-width: "80%"
#| fig-cap: "Basic biplot of the crime data. State abbreviations are shown at their standardized scores on the first two dimensions. The variable vectors reflect the correlations of the variables with the biplot dimensions."
crime.pca <- reflect(crime.pca) # reflect the axes
ggbiplot(crime.pca,
obs.scale = 1, var.scale = 1,
labels = crime$st ,
circle = TRUE,
varname.size = 4,
varname.color = "brown") +
theme_minimal(base_size = 14)
```
In this dataset the states are grouped by region and we saw some differences among regions in the plot (@fig-crime-scores-plot12) of component scores.
`ggbiplot()` provides options to include a `groups =` variable, used to
color the observation points and also to draw their data ellipses, facilitating interpretation.
```{r}
#| label: fig-crime-biplot2
#| out-width: "80%"
#| fig-cap: "Enhanced biplot of the crime data, grouping the states by region and adding data ellipses."
ggbiplot(crime.pca,
obs.scale = 1, var.scale = 1,
groups = crime$region,
labels = crime$st,
labels.size = 4,
var.factor = 1.4,
ellipse = TRUE,
ellipse.prob = 0.5, ellipse.alpha = 0.1,
circle = TRUE,
varname.size = 4,
varname.color = "black") +
labs(fill = "Region", color = "Region") +
theme_minimal(base_size = 14) +
theme(legend.direction = 'horizontal', legend.position = 'top')
```
This plot provides what is necessary to interpret the nature of the components and also the variation of the states in relation to these. In this, the data ellipses for the regions
provide a visual summary that aids interpretation.
* From the variable vectors, it seems that PC1, having all positive and nearly equal loadings, reflects a total or overall index of crimes. Nevada, California, New York and Florida are highest on this, while North Dakota, South Dakota and West Virginia are lowest.
* The second component, PC2, shows a contrast between crimes against persons (murder, assault, rape) at the top and property crimes (auto theft, larceny) at the bottom. Nearly all the Southern states are high on personal crimes; states in the North East are generally higher
on property crimes.
* Western states tend to be somewhat higher on overall crime rate, while North Central are lower on average. In these states there is not much variation in the relative proportions of personal vs. property crimes.
Moreover, in this biplot you can interpret the the value for a particular state on a given crime by considering its projection on the variable vector, where the origin corresponds to the mean, positions along the vector have greater than average values on that crime, and the opposite direction have lower than average values. For example, Massachusetts has the highest value on auto theft, but a value less than the mean. Louisiana and South Carolina on the other hand are highest in the rate of murder and slightly less than average on auto theft.
These 2D plots account for only 76.5% of the total variance of crimes, so it is useful to also examine the third principal component, which accounts for an additional 10.4%.
The `choices =` option controls which dimensions are plotted.
```{r}
#| label: fig-crime-biplot3
#| out-width: "80%"
#| fig-cap: "Biplot of dimensions 1 & 3 of the crime data, with data ellipses for the regions."
ggbiplot(crime.pca,
choices = c(1,3),
obs.scale = 1, var.scale = 1,
groups = crime$region,
labels = crime$st,
labels.size = 4,
var.factor = 2,
ellipse = TRUE,
ellipse.prob = 0.5, ellipse.alpha = 0.1,
circle = TRUE,
varname.size = 4,
varname.color = "black") +
labs(fill = "Region", color = "Region") +
theme_minimal(base_size = 14) +
theme(legend.direction = 'horizontal', legend.position = 'top')
```
Dimension 3 in @fig-crime-biplot3 is more subtle. One interpretation is a contrast between
larceny, which is a larceny (simple theft) and robbery, which involves stealing something from a person
and is considered a more serious crime with an element of possible violence.
In this plot, murder has a relatively short variable vector, so does not contribute
very much to differences among the states.
### Biplot contributions and quality
To better understand how much each variable contributes to the biplot dimensions,
it is helpful to see information about the variance of variables along each dimension.
Graphically, this is nothing more than a measure of the lengths of projections
of the variables on each of the dimensions.
`factoextra::get_pca_var()` calculates a number of tables from a `"prcomp"` or similar object.
```{r var-info}
var_info <- factoextra::get_pca_var(crime.pca)
names(var_info)
```
The component `cor` gives correlations of the variables with the dimensions and
`contrib` gives their variance contributions as percents, where each row and column sums to 100.
```{r contrib}
contrib <- var_info$contrib
cbind(contrib, Total = rowSums(contrib)) |>
rbind(Total = c(colSums(contrib), NA)) |>
round(digits=2)
```
These contributions can be visualized as sorted barcharts for a given axis
using `factoextra::fviz_contrib()`. The dashed horizontal lines are at the average value for each dimension.
```{r}
#| label: fig-fviz-contrib
#| out-width: "100%"
#| fig-cap: "Contributions of the crime variables to dimensions 1 (left) & 2 (right) of the PCA solution"
p1 <- fviz_contrib(crime.pca, choice = "var", axes = 1,
fill = "lightgreen", color = "black")
p2 <- fviz_contrib(crime.pca, choice = "var", axes = 2,
fill = "lightgreen", color = "black")
p1 + p2
```
A simple rubric for interpreting the dimensions in terms of the variable contributions is to mention those that
are largest or above average on each dimension. So, burglary and rape contribute most to the first dimension, while murder and auto theft contribute most to the second.
Another useful measure is called `cos2`, the _quality_ of representation, meaning how much of a variable is represented in a given component. The columns sum to the eigenvalue for each dimension.
The rows each sum to 1.0, meaning each variable is completely
represented on all components, but we can find the quality of a $k$-D solution by summing the values in the
first $k$ columns.
These can be plotted
in a style similar to @fig-fviz-contrib using `factoextra::fviz_cos2()`.
```{r quality}
quality <- var_info$cos2
rowSums(quality)
colSums(quality)
cbind(quality[, 1:2],
Total = rowSums(quality[, 1:2])) |>
round(digits = 2)
```
In two dimensions, murder and burglary are best represented; robbery and larceny are the worst, but as
we saw above (@fig-crime-biplot3), these crimes are implicated in the third dimension.
### Supplementary variables {#sec-supp-vars}
An important feature of biplot methodology is that once you have a reduced-rank display of the
relations among a set of variables, you can use other available data to help interpret what
what is shown in the biplot. In a sense, this is what I did above in @fig-crime-biplot2 and @fig-crime-biplot3
using `region` as a grouping variable and summarizing the variability in the scores for states with their data ellipses by region.
When we have other quantitative variables on the same observations, these can be represented as supplementary
variables in the same space, by what amounts to regressions of these new variables on the scores for the principal component dimensions. For example, the left panel of @fig-supp-regession depicts the vector geometry of a regression
of a variable $\mathbf{y}$ on two predictors, $\mathbf{x}_1$ and $\mathbf{x}_2$. The fitted vector,
$\widehat{\mathbf{y}}$, is the perpendicular projection of $\mathbf{y}$ onto the plane of $\mathbf{x}_1$ and $\mathbf{x}_2$. In the same way, in the right panel the supplementary variable is projected into the plane
of two principal component axes. The fitted vector shows how that additional variable relates to the biplot dimensions.
```{r}
#| label: fig-supp-regession
#| echo: false
#| out-width: "90%"
#| fig-cap: "Fitting supplementary variables in a biplot is analogous to regression on the principal component dimensions. _Source_: @Aluja-etal-2018, Figure 2.11"
knitr::include_graphics("images/pca4ds-figure-2-11.png")
```
For this example, it happens that some suitable supplementary
variables to aid interpretation of crime rates
are available in the dataset `datsets::state.x77`, which was obtained from the U.S. Bureau of the Census
_Statistical Abstract of the United States_ for 1977.
I select a few of these below and make the state name a column variable so it can be merged with the `crime` data.
```{r supp-data}
supp_data <- state.x77 |>
as.data.frame() |>
tibble::rownames_to_column(var = "state") |>
rename(Life_Exp = `Life Exp`,
HS_Grad = `HS Grad`) |>
select(state, Income:Life_Exp, HS_Grad)
head(supp_data)
```
Then, we can merge the `crime` data with the `supp_data` dataset to produce something suitable
for analysis using `factoMineR::PCA()`.
```{r crime-joined}
crime_joined <-
dplyr::left_join(crime[, 1:8], supp_data, by = "state")
names(crime_joined)
```
`PCA()` can only get the labels for the observations from the `row.names()` of the dataset,
so I assign them explicitly. The supplementary variables are specified by the argument `quanti.sup`
as the indices of the columns in what is passed as the data argument.
```{r crime-joined-PCA}
row.names(crime_joined) <- crime$st
crime.PCA_sup <- PCA(crime_joined[,c(2:8, 9:12)],
quanti.sup = 8:11,
scale.unit=TRUE,
ncp=3,
graph = FALSE)
```
The essential difference between the result of `prcomp()` used earlier to get the `crime.pca` object
and the result of `PCA()` with supplementary variables is that the `crime.PCA_sup` object now contains
a `quanti.sup` component containing the coordinates for the supplementary variables in PCA space.
These can be calculated directly as a the coefficients of a multivariate regression of the standardized supplementary variables on the PCA scores for the dimensions, with no intercept to force the fitted vectors to go through the origin. For example, in the plot below (@fig-crime-factominer), the vector for Income has
coordinates (0.192, -0.530) on the first two PCA dimensions.
```{r reg-data}
reg.data <- cbind(scale(supp_data[, -1]),
crime.PCA_sup$ind$coord) |>
as.data.frame()
sup.mod <- lm(cbind(Income, Illiteracy, Life_Exp, HS_Grad) ~
0 + Dim.1 + Dim.2 + Dim.3,
data = reg.data )
(coefs <- t(coef(sup.mod)))
```
Note that these coefficients are the same as the correlations between the supplementary variables
and the scores on the principal components, up to a scaling factor for each dimension. This provides a
general way to relate dimensions found in other methods to the original data variables using vectors
as in biplot techniques.
```{r}
cor(reg.data[, 1:4], reg.data[, 5:7]) |>
print() -> R
R / coefs
```
The `PCA()` result can be plotted using `FactoMiner::plot()`
or various `factoextra` functions like `fviz_pca_var()` for a plot of the variable vectors or
`fviz_pca_biplot()` for a biplot. When a `quanti.sup` component is present, supplementary variables are
also shown in the displays.
For simplicity I use `FactoMiner::plot()` here and only show the variable vectors. For consistency with earlier
plots, I first reflect the orientation of the 2nd PCA dimension so that crimes of personal violence are at the top, as in @fig-crime-vectors.
```{r}
#| label: fig-crime-factominer
#| out-width: "90%"
#| fig-cap: "PCA plot of variables for the crime data, with vectors for the supplementary variables showing their association with the principal component dimensions."
# reverse coordinates of Dim 2
crime.PCA_sup <- ggbiplot::reflect(crime.PCA_sup, columns = 2)
# also reverse the orientation of coordinates for supplementary vars on Dim 2
# crime.PCA_sup$quanti.sup$coord[, 2] <- -crime.PCA_sup$quanti.sup$coord[, 2]
plot(crime.PCA_sup, choix = "var")
```
Recall that from earlier analyses, I interpreted the the dominant PC1 dimension as reflecting overall
rate of crime. The contributions to this dimension, which are the projections of the variable vectors
on the horizontal axis in @fig-crime-vectors and @fig-crime-biplot2 were shown graphically by
barcharts in the left panel of @fig-fviz-contrib.
But now in @fig-crime-factominer, with the addition of variable vectors for the supplementary variables, you can see how income, rate of illiteracy, life expectancy and proportion of high school graduates are related
to the variation in rates of crimes for the U.S. states.
On dimension 1, what stands out is that
life expectancy is associated with lower overall crime, while other supplementary variable have
positive associations.
On dimension 2, crimes against persons (murder, assault, rape) are associated with greater rates of
illiteracy among the states, which as we earlier saw (@fig-crime-biplot2) were more often Southern states.
Crimes against property (auto theft, larceny) at the bottom of this dimension are associated with
higher levels of income and high school graduates
### Example: Diabetes data
As another example, consider the data from @ReavenMiller:79 on measures of insulin and glucose
shown in @fig-diabetes1 and
that led to the discovery of two distinct types of development of Type 2 diabetes (@sec-discoveries).
This dataset is available as `heplots::Diabetes`. The three groups are `Normal`, `Chemical_Diabetic` and
`Overt_Diabetic`, and the (numerical) diagnostic variables are:
* `relwt`: relative weight, the ratio of actual to expected weight, given the person's height,
* `glufast`: fasting blood plasma glucose level
* `glutest`: test blood plasma glucose level, a measure of glucose intolerance
* `instest`: plasma insulin during test, a measure of insulin response to oral glucose
* `sspg`: steady state plasma glucose, a measure of insulin resistance