-
Notifications
You must be signed in to change notification settings - Fork 8
/
PCA.txt
2217 lines (1906 loc) · 113 KB
/
PCA.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Principal component analysis
From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
Conversion of observations of possibly-correlated variables into values of
fewer, uncorrelated variables
[300px-GaussianScatterPCA]
PCA of a multivariate Gaussian distribution centered at (1,3) with a standard
deviation of 3 in roughly the (0.866, 0.5) direction and of 1 in the orthogonal
direction. The vectors shown are the eigenvectors of the covariance matrix
scaled by the square root of the corresponding eigenvalue, and shifted so their
tails are at the mean.
The principal components of a collection of points in a real coordinate space
are a sequence of p {\displaystyle p} p unit vectors, where the i {\
displaystyle i} i-th vector is the direction of a line that best fits the data
while being orthogonal to the first i − 1 {\displaystyle i-1} i-1 vectors.
Here, a best-fitting line is defined as one that minimizes the average squared
distance from the points to the line. These directions constitute an
orthonormal basis in which different individual dimensions of the data are
linearly uncorrelated. Principal component analysis (PCA) is the process of
computing the principal components and using them to perform a change of basis
on the data, sometimes using only the first few principal components and
ignoring the rest.
PCA is used in exploratory data analysis and for making predictive models. It
is commonly used for dimensionality reduction by projecting each data point
onto only the first few principal components to obtain lower-dimensional data
while preserving as much of the data's variation as possible. The first
principal component can equivalently be defined as a direction that maximizes
the variance of the projected data. The i {\displaystyle i} i-th principal
component can be taken as a direction orthogonal to the first i − 1 {\
displaystyle i-1} i-1 principal components that maximizes the variance of the
projected data.
From either objective, it can be shown that the principal components are
eigenvectors of the data's covariance matrix. Thus, the principal components
are often computed by eigendecomposition of the data covariance matrix or
singular value decomposition of the data matrix. PCA is the simplest of the
true eigenvector-based multivariate analyses and is closely related to factor
analysis. Factor analysis typically incorporates more domain specific
assumptions about the underlying structure and solves eigenvectors of a
slightly different matrix. PCA is also related to canonical correlation
analysis (CCA). CCA defines coordinate systems that optimally describe the
cross-covariance between two datasets while PCA defines a new orthogonal
coordinate system that optimally describes variance in a single dataset.^[1]^
[2]^[3]^[4] Robust and L1-norm-based variants of standard PCA have also been
proposed.^[5]^[6]^[4]
[ ]
Contents
• 1 History
• 2 Intuition
• 3 Details
□ 3.1 First component
□ 3.2 Further components
□ 3.3 Covariances
□ 3.4 Dimensionality reduction
□ 3.5 Singular value decomposition
• 4 Further considerations
• 5 Table of symbols and abbreviations
• 6 Properties and limitations of PCA
□ 6.1 Properties
□ 6.2 Limitations
□ 6.3 PCA and information theory
• 7 Computing PCA using the covariance method
• 8 Derivation of PCA using the covariance method
• 9 Covariance-free computation
□ 9.1 Iterative computation
□ 9.2 The NIPALS method
□ 9.3 Online/sequential estimation
• 10 PCA and qualitative variables
• 11 Applications
□ 11.1 Quantitative finance
□ 11.2 Neuroscience
• 12 Relation with other methods
□ 12.1 Correspondence analysis
□ 12.2 Factor analysis
□ 12.3 K-means clustering
□ 12.4 Non-negative matrix factorization
□ 12.5 Iconography of correlations
• 13 Generalizations
□ 13.1 Sparse PCA
□ 13.2 Nonlinear PCA
□ 13.3 Robust PCA
• 14 Similar techniques
□ 14.1 Independent component analysis
□ 14.2 Network component analysis
□ 14.3 Discriminant analysis of principal components
• 15 Software/source code
• 16 See also
• 17 References
• 18 Further reading
• 19 External links
History[edit]
PCA was invented in 1901 by Karl Pearson,^[7] as an analogue of the principal
axis theorem in mechanics; it was later independently developed and named by
Harold Hotelling in the 1930s.^[8] Depending on the field of application, it is
also named the discrete Karhunen–Loève transform (KLT) in signal processing,
the Hotelling transform in multivariate quality control, proper orthogonal
decomposition (POD) in mechanical engineering, singular value decomposition
(SVD) of X (invented in the last quarter of the 19th century^[9]), eigenvalue
decomposition (EVD) of X^TX in linear algebra, factor analysis (for a
discussion of the differences between PCA and factor analysis see Ch. 7 of
Jolliffe's Principal Component Analysis),^[10] Eckart–Young theorem (Harman,
1960), or empirical orthogonal functions (EOF) in meteorological science,
empirical eigenfunction decomposition (Sirovich, 1987), empirical component
analysis (Lorenz, 1956), quasiharmonic modes (Brooks et al., 1988), spectral
decomposition in noise and vibration, and empirical modal analysis in
structural dynamics.
Intuition[edit]
PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where
each axis of the ellipsoid represents a principal component. If some axis of
the ellipsoid is small, then the variance along that axis is also small.
To find the axes of the ellipsoid, we must first center the values of each
variable in the dataset on 0 by subtracting the mean of the variable's observed
values from each of those values. These transformed values are used instead of
the original observed values for each of the variables. Then, we compute the
covariance matrix of the data and calculate the eigenvalues and corresponding
eigenvectors of this covariance matrix. Then we must normalize each of the
orthogonal eigenvectors to turn them into unit vectors. Once this is done, each
of the mutually orthogonal unit eigenvectors can be interpreted as an axis of
the ellipsoid fitted to the data. This choice of basis will transform our
covariance matrix into a diagonalised form with the diagonal elements
representing the variance of each axis. The proportion of the variance that
each eigenvector represents can be calculated by dividing the eigenvalue
corresponding to that eigenvector by the sum of all eigenvalues.
Biplots and scree plots (degree of explained variance) are used to explain
findings of the PCA.
Details[edit]
PCA is defined as an orthogonal linear transformation that transforms the data
to a new coordinate system such that the greatest variance by some scalar
projection of the data comes to lie on the first coordinate (called the first
principal component), the second greatest variance on the second coordinate,
and so on.^[10]^[page needed]
Consider an n × p {\displaystyle n\times p} n\times p data matrix, X, with
column-wise zero empirical mean (the sample mean of each column has been
shifted to zero), where each of the n rows represents a different repetition of
the experiment, and each of the p columns gives a particular kind of feature
(say, the results from a particular sensor).
Mathematically, the transformation is defined by a set of size l {\displaystyle
l} l of p-dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … ,
w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} \
mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)} that map each row vector x ( i )
{\displaystyle \mathbf {x} _{(i)}} \mathbf{x}_{(i)} of X to a new vector of
principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \
mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} {\displaystyle \mathbf {t} _{(i)}
=(t_{1},\dots ,t_{l})_{(i)}}, given by
t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\
displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad
\mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} {\displaystyle {t_
{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \
qquad i=1,\dots ,n\qquad k=1,\dots ,l}
in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},
\dots ,t_{l}} {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data
set successively inherit the maximum possible variance from X, with each
coefficient vector w constrained to be a unit vector (where l {\displaystyle l}
l is usually selected to be strictly less than p {\displaystyle p} p to reduce
dimensionality).
First component[edit]
In order to maximize variance, the first weight vector w[(1)] thus has to
satisfy
w ( 1 ) = arg max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg max ‖ w ‖ = 1
{ ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\
Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg
\max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _
{(i)}\cdot \mathbf {w} \right)^{2}\right\}} {\displaystyle \mathbf {w} _
{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_
{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}
\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}}
Equivalently, writing this in matrix form gives
w ( 1 ) = arg max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg max ‖ w ‖ = 1 { w T X T
X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \
right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left
\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\
mathsf {T}}\mathbf {Xw} \right\}} {\displaystyle \mathbf {w} _{(1)}=\arg \
max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}
\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\
mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}}
Since w[(1)] has been defined to be a unit vector, it equivalently also
satisfies
w ( 1 ) = arg max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}
=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf
{T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} {\
displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\
mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf
{T}}\mathbf {w} }}\right\}}
The quantity to be maximised can be recognised as a Rayleigh quotient. A
standard result for a positive semidefinite matrix such as X^TX is that the
quotient's maximum possible value is the largest eigenvalue of the matrix,
which occurs when w is the corresponding eigenvector.
With w[(1)] found, the first principal component of a data vector x[(i)] can
then be given as a score t[1(i)] = x[(i)] ⋅ w[(1)] in the transformed
co-ordinates, or as the corresponding vector in the original variables, {x[(i)]
⋅ w[(1)]} w[(1)].
Further components[edit]
The k-th component can be found by subtracting the first k − 1 principal
components from X:
X ^ k = X − ∑ s = 1 k − 1 X w ( s ) w ( s ) T {\displaystyle \mathbf {\hat
{X}} _{k}=\mathbf {X} -\sum _{s=1}^{k-1}\mathbf {X} \mathbf {w} _{(s)}\
mathbf {w} _{(s)}^{\mathsf {T}}} {\displaystyle \mathbf {\hat {X}} _{k}=\
mathbf {X} -\sum _{s=1}^{k-1}\mathbf {X} \mathbf {w} _{(s)}\mathbf {w} _
{(s)}^{\mathsf {T}}}
and then finding the weight vector which extracts the maximum variance from
this new data matrix
w ( k ) = a r g m a x ‖ w ‖ = 1 { ‖ X ^ k w ‖ 2 } = arg max { w T X ^ k
T X ^ k w w T w } {\displaystyle \mathbf {w} _{(k)}=\mathop {\operatorname
{arg\,max} } _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {\hat
{X}} _{k}\mathbf {w} \right\|^{2}\right\}=\arg \max \left\{{\tfrac {\mathbf
{w} ^{\mathsf {T}}\mathbf {\hat {X}} _{k}^{\mathsf {T}}\mathbf {\hat {X}} _
{k}\mathbf {w} }{\mathbf {w} ^{T}\mathbf {w} }}\right\}} {\displaystyle \
mathbf {w} _{(k)}=\mathop {\operatorname {arg\,max} } _{\left\|\mathbf {w}
\right\|=1}\left\{\left\|\mathbf {\hat {X}} _{k}\mathbf {w} \right\|^{2}\
right\}=\arg \max \left\{{\tfrac {\mathbf {w} ^{\mathsf {T}}\mathbf {\hat
{X}} _{k}^{\mathsf {T}}\mathbf {\hat {X}} _{k}\mathbf {w} }{\mathbf {w} ^
{T}\mathbf {w} }}\right\}}
It turns out that this gives the remaining eigenvectors of X^TX, with the
maximum values for the quantity in brackets given by their corresponding
eigenvalues. Thus the weight vectors are eigenvectors of X^TX.
The k-th principal component of a data vector x[(i)] can therefore be given as
a score t[k(i)] = x[(i)] ⋅ w[(k)] in the transformed coordinates, or as the
corresponding vector in the space of the original variables, {x[(i)] ⋅ w[(k)]}
w[(k)], where w[(k)] is the kth eigenvector of X^TX.
The full principal components decomposition of X can therefore be given as
T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } \mathbf{T} =
\mathbf{X} \mathbf{W}
where W is a p-by-p matrix of weights whose columns are the eigenvectors of X^T
X. The transpose of W is sometimes called the whitening or sphering
transformation. Columns of W multiplied by the square root of corresponding
eigenvalues, that is, eigenvectors scaled up by the variances, are called
loadings in PCA or in Factor analysis.
Covariances[edit]
X^TX itself can be recognized as proportional to the empirical sample
covariance matrix of the dataset X^T.^[10]^: 30–31
The sample covariance Q between two of the different principal components over
the dataset is given by:
Q ( P C ( j ) , P C ( k ) ) ∝ ( X w ( j ) ) T ( X w ( k ) ) = w ( j ) T X T
X w ( k ) = w ( j ) T λ ( k ) w ( k ) = λ ( k ) w ( j ) T w ( k ) {\
displaystyle {\begin{aligned}Q(\mathrm {PC} _{(j)},\mathrm {PC} _{(k)})&\
propto (\mathbf {X} \mathbf {w} _{(j)})^{\mathsf {T}}(\mathbf {X} \mathbf
{w} _{(k)})\\&=\mathbf {w} _{(j)}^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\
mathbf {X} \mathbf {w} _{(k)}\\&=\mathbf {w} _{(j)}^{\mathsf {T}}\lambda _
{(k)}\mathbf {w} _{(k)}\\&=\lambda _{(k)}\mathbf {w} _{(j)}^{\mathsf {T}}\
mathbf {w} _{(k)}\end{aligned}}} {\displaystyle {\begin{aligned}Q(\mathrm
{PC} _{(j)},\mathrm {PC} _{(k)})&\propto (\mathbf {X} \mathbf {w} _{(j)})^
{\mathsf {T}}(\mathbf {X} \mathbf {w} _{(k)})\\&=\mathbf {w} _{(j)}^{\
mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {X} \mathbf {w} _{(k)}\\&=\
mathbf {w} _{(j)}^{\mathsf {T}}\lambda _{(k)}\mathbf {w} _{(k)}\\&=\lambda
_{(k)}\mathbf {w} _{(j)}^{\mathsf {T}}\mathbf {w} _{(k)}\end{aligned}}}
where the eigenvalue property of w[(k)] has been used to move from line 2 to
line 3. However eigenvectors w[(j)] and w[(k)] corresponding to eigenvalues of
a symmetric matrix are orthogonal (if the eigenvalues are different), or can be
orthogonalised (if the vectors happen to share an equal repeated value). The
product in the final line is therefore zero; there is no sample covariance
between different principal components over the dataset.
Another way to characterise the principal components transformation is
therefore as the transformation to coordinates which diagonalise the empirical
sample covariance matrix.
In matrix form, the empirical covariance matrix for the original variables can
be written
Q ∝ X T X = W Λ W T {\displaystyle \mathbf {Q} \propto \mathbf {X} ^{\
mathsf {T}}\mathbf {X} =\mathbf {W} \mathbf {\Lambda } \mathbf {W} ^{\
mathsf {T}}} {\displaystyle \mathbf {Q} \propto \mathbf {X} ^{\mathsf {T}}\
mathbf {X} =\mathbf {W} \mathbf {\Lambda } \mathbf {W} ^{\mathsf {T}}}
The empirical covariance matrix between the principal components becomes
W T Q W ∝ W T W Λ W T W = Λ {\displaystyle \mathbf {W} ^{\mathsf {T}}\
mathbf {Q} \mathbf {W} \propto \mathbf {W} ^{\mathsf {T}}\mathbf {W} \,\
mathbf {\Lambda } \,\mathbf {W} ^{\mathsf {T}}\mathbf {W} =\mathbf {\Lambda
} } {\displaystyle \mathbf {W} ^{\mathsf {T}}\mathbf {Q} \mathbf {W} \
propto \mathbf {W} ^{\mathsf {T}}\mathbf {W} \,\mathbf {\Lambda } \,\mathbf
{W} ^{\mathsf {T}}\mathbf {W} =\mathbf {\Lambda } }
where Λ is the diagonal matrix of eigenvalues λ[(k)] of X^TX. λ[(k)] is equal
to the sum of the squares over the dataset associated with each component k,
that is, λ[(k)] = Σ[i] t[k]^2[(i)] = Σ[i] (x[(i)] ⋅ w[(k)])^2.
Dimensionality reduction[edit]
The transformation T = X W maps a data vector x[(i)] from an original space of
p variables to a new space of p variables which are uncorrelated over the
dataset. However, not all the principal components need to be kept. Keeping
only the first L principal components, produced by using only the first L
eigenvectors, gives the truncated transformation
T L = X W L {\displaystyle \mathbf {T} _{L}=\mathbf {X} \mathbf {W} _{L}} \
mathbf{T}_L = \mathbf{X} \mathbf{W}_L
where the matrix T[L] now has n rows but only L columns. In other words, PCA
learns a linear transformation t = W L T x , x ∈ R p , t ∈ R L , {\displaystyle
t=W_{L}^{\mathsf {T}}x,x\in R^{p},t\in R^{L},} {\displaystyle t=W_{L}^{\mathsf
{T}}x,x\in R^{p},t\in R^{L},} where the columns of p × L matrix W L {\
displaystyle W_{L}} {\displaystyle W_{L}} form an orthogonal basis for the L
features (the components of representation t) that are decorrelated.^[11] By
construction, of all the transformed data matrices with only L columns, this
score matrix maximises the variance in the original data that has been
preserved, while minimising the total squared reconstruction error ‖ T W T − T
L W L T ‖ 2 2 {\displaystyle \|\mathbf {T} \mathbf {W} ^{T}-\mathbf {T} _{L}\
mathbf {W} _{L}^{T}\|_{2}^{2}} \|\mathbf {T} \mathbf {W} ^{T}-\mathbf {T} _{L}\
mathbf {W} _{L}^{T}\|_{2}^{2} or ‖ X − X L ‖ 2 2 {\displaystyle \|\mathbf {X} -
\mathbf {X} _{L}\|_{2}^{2}} \|\mathbf {X} -\mathbf {X} _{L}\|_{2}^{2}.
[220px-PCA_of_Haplogroup_J_]
A principal components analysis scatterplot of Y-STR haplotypes calculated from
repeat-count values for 37 Y-chromosomal STR markers from 354 individuals.
PCA has successfully found linear combinations of the markers that separate out
different clusters corresponding to different lines of individuals'
Y-chromosomal genetic descent.
Such dimensionality reduction can be a very useful step for visualising and
processing high-dimensional datasets, while still retaining as much of the
variance in the dataset as possible. For example, selecting L = 2 and keeping
only the first two principal components finds the two-dimensional plane through
the high-dimensional dataset in which the data is most spread out, so if the
data contains clusters these too may be most spread out, and therefore most
visible to be plotted out in a two-dimensional diagram; whereas if two
directions through the data (or two of the original variables) are chosen at
random, the clusters may be much less spread apart from each other, and may in
fact be much more likely to substantially overlay each other, making them
indistinguishable.
Similarly, in regression analysis, the larger the number of explanatory
variables allowed, the greater is the chance of overfitting the model,
producing conclusions that fail to generalise to other datasets. One approach,
especially when there are strong correlations between different possible
explanatory variables, is to reduce them to a few principal components and then
run the regression against them, a method called principal component regression
.
Dimensionality reduction may also be appropriate when the variables in a
dataset are noisy. If each column of the dataset contains independent
identically distributed Gaussian noise, then the columns of T will also contain
similarly identically distributed Gaussian noise (such a distribution is
invariant under the effects of the matrix W, which can be thought of as a
high-dimensional rotation of the co-ordinate axes). However, with more of the
total variance concentrated in the first few principal components compared to
the same noise variance, the proportionate effect of the noise is less—the
first few components achieve a higher signal-to-noise ratio. PCA thus can have
the effect of concentrating much of the signal into the first few principal
components, which can usefully be captured by dimensionality reduction; while
the later principal components may be dominated by noise, and so disposed of
without great loss. If the dataset is not too large, the significance of the
principal components can be tested using parametric bootstrap, as an aid in
determining how many principal components to retain.^[12]
Singular value decomposition[edit]
Main article: Singular value decomposition
The principal components transformation can also be associated with another
matrix factorization, the singular value decomposition (SVD) of X,
X = U Σ W T {\displaystyle \mathbf {X} =\mathbf {U} \mathbf {\Sigma } \
mathbf {W} ^{T}} \mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{W}^T
Here Σ is an n-by-p rectangular diagonal matrix of positive numbers σ[(k)],
called the singular values of X; U is an n-by-n matrix, the columns of which
are orthogonal unit vectors of length n called the left singular vectors of X;
and W is a p-by-p whose columns are orthogonal unit vectors of length p and
called the right singular vectors of X.
In terms of this factorization, the matrix X^TX can be written
X T X = W Σ T U T U Σ W T = W Σ T Σ W T = W Σ ^ 2 W T {\displaystyle {\
begin{aligned}\mathbf {X} ^{T}\mathbf {X} &=\mathbf {W} \mathbf {\Sigma } ^
{\mathsf {T}}\mathbf {U} ^{\mathsf {T}}\mathbf {U} \mathbf {\Sigma } \
mathbf {W} ^{\mathsf {T}}\\&=\mathbf {W} \mathbf {\Sigma } ^{\mathsf {T}}\
mathbf {\Sigma } \mathbf {W} ^{\mathsf {T}}\\&=\mathbf {W} \mathbf {\hat {\
Sigma }} ^{2}\mathbf {W} ^{\mathsf {T}}\end{aligned}}} {\displaystyle {\
begin{aligned}\mathbf {X} ^{T}\mathbf {X} &=\mathbf {W} \mathbf {\Sigma } ^
{\mathsf {T}}\mathbf {U} ^{\mathsf {T}}\mathbf {U} \mathbf {\Sigma } \
mathbf {W} ^{\mathsf {T}}\\&=\mathbf {W} \mathbf {\Sigma } ^{\mathsf {T}}\
mathbf {\Sigma } \mathbf {W} ^{\mathsf {T}}\\&=\mathbf {W} \mathbf {\hat {\
Sigma }} ^{2}\mathbf {W} ^{\mathsf {T}}\end{aligned}}}
where Σ ^ {\displaystyle \mathbf {\hat {\Sigma }} } {\displaystyle \mathbf {\
hat {\Sigma }} } is the square diagonal matrix with the singular values of X
and the excess zeros chopped off that satisfies Σ ^ 2 = Σ T Σ {\displaystyle \
mathbf {{\hat {\Sigma }}^{2}} =\mathbf {\Sigma } ^{\mathsf {T}}\mathbf {\Sigma
} } {\displaystyle \mathbf {{\hat {\Sigma }}^{2}} =\mathbf {\Sigma } ^{\mathsf
{T}}\mathbf {\Sigma } }. Comparison with the eigenvector factorization of X^TX
establishes that the right singular vectors W of X are equivalent to the
eigenvectors of X^TX, while the singular values σ[(k)] of X {\displaystyle \
mathbf {X} } {\displaystyle \mathbf {X} } are equal to the square-root of the
eigenvalues λ[(k)] of X^TX.
Using the singular value decomposition the score matrix T can be written
T = X W = U Σ W T W = U Σ {\displaystyle {\begin{aligned}\mathbf {T} &=\
mathbf {X} \mathbf {W} \\&=\mathbf {U} \mathbf {\Sigma } \mathbf {W} ^{\
mathsf {T}}\mathbf {W} \\&=\mathbf {U} \mathbf {\Sigma } \end{aligned}}} {\
displaystyle {\begin{aligned}\mathbf {T} &=\mathbf {X} \mathbf {W} \\&=\
mathbf {U} \mathbf {\Sigma } \mathbf {W} ^{\mathsf {T}}\mathbf {W} \\&=\
mathbf {U} \mathbf {\Sigma } \end{aligned}}}
so each column of T is given by one of the left singular vectors of X
multiplied by the corresponding singular value. This form is also the polar
decomposition of T.
Efficient algorithms exist to calculate the SVD of X without having to form the
matrix X^TX, so computing the SVD is now the standard way to calculate a
principal components analysis from a data matrix^[citation needed], unless only
a handful of components are required.
As with the eigen-decomposition, a truncated n × L score matrix T[L] can be
obtained by considering only the first L largest singular values and their
singular vectors:
T L = U L Σ L = X W L {\displaystyle \mathbf {T} _{L}=\mathbf {U} _{L}\
mathbf {\Sigma } _{L}=\mathbf {X} \mathbf {W} _{L}} \mathbf{T}_L = \mathbf
{U}_L\mathbf{\Sigma}_L = \mathbf{X} \mathbf{W}_L
The truncation of a matrix M or T using a truncated singular value
decomposition in this way produces a truncated matrix that is the nearest
possible matrix of rank L to the original matrix, in the sense of the
difference between the two having the smallest possible Frobenius norm, a
result known as the Eckart–Young theorem [1936].
Further considerations[edit]
Given a set of points in Euclidean space, the first principal component
corresponds to a line that passes through the multidimensional mean and
minimizes the sum of squares of the distances of the points from the line^[
disputed – discuss]. The second principal component corresponds to the same
concept after all correlation with the first principal component has been
subtracted from the points. The singular values (in Σ) are the square roots of
the eigenvalues of the matrix X^TX. Each eigenvalue is proportional to the
portion of the "variance" (more correctly of the sum of the squared distances
of the points from their multidimensional mean) that is associated with each
eigenvector. The sum of all the eigenvalues is equal to the sum of the squared
distances of the points from their multidimensional mean. PCA essentially
rotates the set of points around their mean in order to align with the
principal components. This moves as much of the variance as possible (using an
orthogonal transformation) into the first few dimensions. The values in the
remaining dimensions, therefore, tend to be small and may be dropped with
minimal loss of information (see below). PCA is often used in this manner for
dimensionality reduction. PCA has the distinction of being the optimal
orthogonal transformation for keeping the subspace that has largest "variance"
(as defined above). This advantage, however, comes at the price of greater
computational requirements if compared, for example, and when applicable, to
the discrete cosine transform, and in particular to the DCT-II which is simply
known as the "DCT". Nonlinear dimensionality reduction techniques tend to be
more computationally demanding than PCA.
PCA is sensitive to the scaling of the variables. If we have just two variables
and they have the same sample variance and are positively correlated, then the
PCA will entail a rotation by 45° and the "weights" (they are the cosines of
rotation) for the two variables with respect to the principal component will be
equal. But if we multiply all values of the first variable by 100, then the
first principal component will be almost the same as that variable, with a
small contribution from the other variable, whereas the second component will
be almost aligned with the second original variable. This means that whenever
the different variables have different units (like temperature and mass), PCA
is a somewhat arbitrary method of analysis. (Different results would be
obtained if one used Fahrenheit rather than Celsius for example.) Pearson's
original paper was entitled "On Lines and Planes of Closest Fit to Systems of
Points in Space" – "in space" implies physical Euclidean space where such
concerns do not arise. One way of making the PCA less arbitrary is to use
variables scaled so as to have unit variance, by standardizing the data and
hence use the autocorrelation matrix instead of the autocovariance matrix as a
basis for PCA. However, this compresses (or expands) the fluctuations in all
dimensions of the signal space to unit variance.
Mean subtraction (a.k.a. "mean centering") is necessary for performing
classical PCA to ensure that the first principal component describes the
direction of maximum variance. If mean subtraction is not performed, the first
principal component might instead correspond more or less to the mean of the
data. A mean of zero is needed for finding a basis that minimizes the mean
square error of the approximation of the data.^[13]
Mean-centering is unnecessary if performing a principal components analysis on
a correlation matrix, as the data are already centered after calculating
correlations. Correlations are derived from the cross-product of two standard
scores (Z-scores) or statistical moments (hence the name: Pearson
Product-Moment Correlation). Also see the article by Kromrey & Foster-Johnson
(1998) on "Mean-centering in Moderated Regression: Much Ado About Nothing".
PCA is a popular primary technique in pattern recognition. It is not, however,
optimized for class separability.^[14] However, it has been used to quantify
the distance between two or more classes by calculating center of mass for each
class in principal component space and reporting Euclidean distance between
center of mass of two or more classes.^[15] The linear discriminant analysis is
an alternative which is optimized for class separability.
Table of symbols and abbreviations[edit]
Symbol Meaning Dimensions Indices
X = { X i j } {\ i = 1 … n {\
displaystyle \ displaystyle i
mathbf {X} =\{X_ data matrix, consisting of the n × p {\ =1\ldots n} i
{ij}\}} {\ set of all data vectors, one displaystyle = 1 \ldots n
displaystyle \ vector per row n\times p} n j = 1 … p {\
mathbf {X} =\{X_ \times p displaystyle j
{ij}\}} =1\ldots p} j
= 1 \ldots p
1 × 1 {\
n {\displaystyle n} the number of row vectors in displaystyle scalar
n the data set 1\times 1} 1
\times 1
1 × 1 {\
p {\displaystyle p} the number of elements in each displaystyle scalar
p row vector (dimension) 1\times 1} 1
\times 1
the number of dimensions in the 1 × 1 {\
L {\displaystyle L} dimensionally reduced subspace, displaystyle scalar
L 1 ≤ L ≤ p {\displaystyle 1\leq 1\times 1} 1
L\leq p} 1 \le L \le p \times 1
u = { u j } {\
displaystyle \ vector of empirical means, one p × 1 {\ j = 1 … p {\
mathbf {u} =\{u_{j} mean for each column j of the displaystyle displaystyle j
\}} {\displaystyle data matrix p\times 1} p =1\ldots p} j
\mathbf {u} =\{u_ \times 1 = 1 \ldots p
{j}\}}
s = { s j } {\
displaystyle \ vector of empirical standard p × 1 {\ j = 1 … p {\
mathbf {s} =\{s_{j} deviations, one standard displaystyle displaystyle j
\}} {\displaystyle deviation for each column j of p\times 1} p =1\ldots p} j
\mathbf {s} =\{s_ the data matrix \times 1 = 1 \ldots p
{j}\}}
h = { h i } {\
displaystyle \ 1 × n {\ i = 1 … n {\
mathbf {h} =\{h_{i} vector of all 1's displaystyle displaystyle i
\}} {\displaystyle 1\times n} 1 =1\ldots n} i
\mathbf {h} =\{h_ \times n = 1 \ldots n
{i}\}}
B = { B i j } {\ i = 1 … n {\
displaystyle \ displaystyle i
mathbf {B} =\{B_ deviations from the mean of n × p {\ =1\ldots n} i
{ij}\}} {\ each column j of the data displaystyle = 1 \ldots n
displaystyle \ matrix n\times p} n j = 1 … p {\
mathbf {B} =\{B_ \times p displaystyle j
{ij}\}} =1\ldots p} j
= 1 \ldots p
Z = { Z i j } {\ i = 1 … n {\
displaystyle \ displaystyle i
mathbf {Z} =\{Z_ z-scores, computed using the n × p {\ =1\ldots n} i
{ij}\}} {\ mean and standard deviation for displaystyle = 1 \ldots n
displaystyle \ each row m of the data matrix n\times p} n j = 1 … p {\
mathbf {Z} =\{Z_ \times p displaystyle j
{ij}\}} =1\ldots p} j
= 1 \ldots p
j = 1 … p {\
C = { C j j ′ } {\ displaystyle j
displaystyle \ p × p {\ =1\ldots p} j
mathbf {C} =\{C_ displaystyle = 1 \ldots p
{jj'}\}} {\ covariance matrix p\times p} p j ′ = 1 … p {\
displaystyle \ \times p displaystyle
mathbf {C} =\{C_ j'=1\ldots p}
{jj'}\}} {\displaystyle
j'=1\ldots p}
j = 1 … p {\
R = { R j j ′ } {\ displaystyle j
displaystyle \ p × p {\ =1\ldots p} j
mathbf {R} =\{R_ displaystyle = 1 \ldots p
{jj'}\}} {\ correlation matrix p\times p} p j ′ = 1 … p {\
displaystyle \ \times p displaystyle
mathbf {R} =\{R_ j'=1\ldots p}
{jj'}\}} {\displaystyle
j'=1\ldots p}
j = 1 … p {\
V = { V j j ′ } {\ displaystyle j
displaystyle \ p × p {\ =1\ldots p} j
mathbf {V} =\{V_ matrix consisting of the set of displaystyle = 1 \ldots p
{jj'}\}} {\ all eigenvectors of C, one p\times p} p j ′ = 1 … p {\
displaystyle \ eigenvector per column \times p displaystyle
mathbf {V} =\{V_ j'=1\ldots p}
{jj'}\}} {\displaystyle
j'=1\ldots p}
diagonal matrix consisting of j = 1 … p {\
D = { D j j ′ } {\ the set of all eigenvalues of C displaystyle j
displaystyle \ along its principal diagonal, p × p {\ =1\ldots p} j
mathbf {D} =\{D_ and 0 for all other elements ( displaystyle = 1 \ldots p
{jj'}\}} {\ note Λ {\displaystyle \mathbf p\times p} p j ′ = 1 … p {\
displaystyle \ {\Lambda } } {\displaystyle \ \times p displaystyle
mathbf {D} =\{D_ mathbf {\Lambda } } used above j'=1\ldots p}
{jj'}\}} ) {\displaystyle
j'=1\ldots p}
j = 1 … p {\
W = { W j l } {\ matrix of basis vectors, one displaystyle j
displaystyle \ vector per column, where each p × L {\ =1\ldots p} j
mathbf {W} =\{W_ basis vector is one of the displaystyle = 1 \ldots p
{jl}\}} {\ eigenvectors of C, and where p\times L} p l = 1 … L {\
displaystyle \ the vectors in W are a sub-set \times L displaystyle l
mathbf {W} =\{W_ of those in V =1\ldots L} {\
{jl}\}} displaystyle l
=1\ldots L}
i = 1 … n {\
T = { T i l } {\ matrix consisting of n row displaystyle i
displaystyle \ vectors, where each vector is n × L {\ =1\ldots n} i
mathbf {T} =\{T_ the projection of the displaystyle = 1 \ldots n
{il}\}} {\ corresponding data vector from n\times L} n l = 1 … L {\
displaystyle \ matrix X onto the basis vectors \times L displaystyle l
mathbf {T} =\{T_ contained in the columns of =1\ldots L} {\
{il}\}} matrix W. displaystyle l
=1\ldots L}
Properties and limitations of PCA[edit]
Properties[edit]
Some properties of PCA include:^[10]^[page needed]
Property 1: For any integer q, 1 ≤ q ≤ p, consider the orthogonal linear
transformation
y = B ′ x {\displaystyle y=\mathbf {B'} x} y =\mathbf{B'}x
where y {\displaystyle y} y is a q-element vector and B ′ {\displaystyle \
mathbf {B'} } \mathbf{B'} is a (q × p) matrix, and let Σ y = B ′ Σ B {\
displaystyle \mathbf {\Sigma } _{y}=\mathbf {B'} \mathbf {\Sigma } \mathbf
{B} } {\mathbf {{\Sigma }}}_{y}={\mathbf {B'}}{\mathbf {\Sigma }}{\mathbf
{B}} be the variance-covariance matrix for y {\displaystyle y} y. Then the
trace of Σ y {\displaystyle \mathbf {\Sigma } _{y}} {\mathbf {\Sigma }}_{y}
, denoted tr ( Σ y ) {\displaystyle \operatorname {tr} (\mathbf {\Sigma }
_{y})} {\displaystyle \operatorname {tr} (\mathbf {\Sigma } _{y})}, is
maximized by taking B = A q {\displaystyle \mathbf {B} =\mathbf {A} _{q}} \
mathbf{B} = \mathbf{A}_q, where A q {\displaystyle \mathbf {A} _{q}} \
mathbf{A}_q consists of the first q columns of A {\displaystyle \mathbf {A}
} \mathbf {A} ( B ′ {\displaystyle (\mathbf {B'} } (\mathbf{B'} is the
transposition of B ) {\displaystyle \mathbf {B} )} \mathbf{B}).
Property 2: Consider again the orthonormal transformation
y = B ′ x {\displaystyle y=\mathbf {B'} x} y = \mathbf{B'}x
with x , B , A {\displaystyle x,\mathbf {B} ,\mathbf {A} } x,{\mathbf {B}},
{\mathbf {A}} and Σ y {\displaystyle \mathbf {\Sigma } _{y}} {\mathbf {\
Sigma }}_{y} defined as before. Then tr ( Σ y ) {\displaystyle \
operatorname {tr} (\mathbf {\Sigma } _{y})} {\displaystyle \operatorname
{tr} (\mathbf {\Sigma } _{y})} is minimized by taking B = A q ∗ , {\
displaystyle \mathbf {B} =\mathbf {A} _{q}^{*},} {\mathbf {B}}={\mathbf
{A}}_{q}^{*}, where A q ∗ {\displaystyle \mathbf {A} _{q}^{*}} \mathbf{A}_q
^* consists of the last q columns of A {\displaystyle \mathbf {A} } \mathbf
{A} .
The statistical implication of this property is that the last few PCs are not
simply unstructured left-overs after removing the important PCs. Because these
last PCs have variances as small as possible they are useful in their own
right. They can help to detect unsuspected near-constant linear relationships
between the elements of x, and they may also be useful in regression, in
selecting a subset of variables from x, and in outlier detection.
Property 3: (Spectral decomposition of Σ)
Σ = λ 1 α 1 α 1 ′ + ⋯ + λ p α p α p ′ {\displaystyle \mathbf {\Sigma }
=\lambda _{1}\alpha _{1}\alpha _{1}'+\cdots +\lambda _{p}\alpha _{p}\
alpha _{p}'} {\displaystyle \mathbf {\Sigma } =\lambda _{1}\alpha _{1}\
alpha _{1}'+\cdots +\lambda _{p}\alpha _{p}\alpha _{p}'}
Before we look at its usage, we first look at diagonal elements,
Var ( x j ) = ∑ k = 1 P λ k α k j 2 {\displaystyle \operatorname {Var}
(x_{j})=\sum _{k=1}^{P}\lambda _{k}\alpha _{kj}^{2}} {\displaystyle \
operatorname {Var} (x_{j})=\sum _{k=1}^{P}\lambda _{k}\alpha _{kj}^{2}}
Then, perhaps the main statistical implication of the result is that not only
can we decompose the combined variances of all the elements of x into
decreasing contributions due to each PC, but we can also decompose the whole
covariance matrix into contributions λ k α k α k ′ {\displaystyle \lambda _{k}\
alpha _{k}\alpha _{k}'} {\displaystyle \lambda _{k}\alpha _{k}\alpha _{k}'}
from each PC. Although not strictly decreasing, the elements of λ k α k α k ′
{\displaystyle \lambda _{k}\alpha _{k}\alpha _{k}'} {\displaystyle \lambda _{k}
\alpha _{k}\alpha _{k}'} will tend to become smaller as k {\displaystyle k} k
increases, as λ k α k α k ′ {\displaystyle \lambda _{k}\alpha _{k}\alpha _{k}'}
{\displaystyle \lambda _{k}\alpha _{k}\alpha _{k}'} is nonincreasing for
increasing k {\displaystyle k} k, whereas the elements of α k {\displaystyle \
alpha _{k}} \alpha _{k} tend to stay about the same size because of the
normalization constraints: α k ′ α k = 1 , k = 1 , … , p {\displaystyle \alpha
_{k}'\alpha _{k}=1,k=1,\dots ,p} {\displaystyle \alpha _{k}'\alpha _{k}=1,k=1,\
dots ,p}.
Limitations[edit]
As noted above, the results of PCA depend on the scaling of the variables. This
can be cured by scaling each feature by its standard deviation, so that one
ends up with dimensionless features with unital variance.^[16]
The applicability of PCA as described above is limited by certain (tacit)
assumptions^[17] made in its derivation. In particular, PCA can capture linear
correlations between the features but fails when this assumption is violated
(see Figure 6a in the reference). In some cases, coordinate transformations can
restore the linearity assumption and PCA can then be applied (see kernel PCA).
Another limitation is the mean-removal process before constructing the
covariance matrix for PCA. In fields such as astronomy, all the signals are
non-negative, and the mean-removal process will force the mean of some
astrophysical exposures to be zero, which consequently creates unphysical
negative fluxes,^[18] and forward modeling has to be performed to recover the
true magnitude of the signals.^[19] As an alternative method, non-negative
matrix factorization focusing only on the non-negative elements in the
matrices, which is well-suited for astrophysical observations.^[20]^[21]^[22]
See more at Relation between PCA and Non-negative Matrix Factorization.
PCA is at a disadvantage if the data has not been standardized before applying
the algorithm to it. PCA transforms original data into data that is relevant to
the principal components of that data, which means that the new data variables
cannot be interpreted in the same ways that the originals were. They are linear
interpretations of the original variables. Also, if PCA is not performed
properly, there is a high likelihood of information loss.^[23]
PCA relies on a linear model. If a dataset has a pattern hidden inside it that
is nonlinear, then PCA can actually steer the analysis in the complete opposite
direction of progress.^[24]^[page needed] Researchers at Kansas State
University discovered that the sampling error in their experiments impacted the
bias of PCA results. "If the number of subjects or blocks is smaller than 30,
and/or the researcher is interested in PC's beyond the first, it may be better
to first correct for the serial correlation, before PCA is conducted".^[25] The
researchers at Kansas State also found that PCA could be "seriously biased if
the autocorrelation structure of the data is not correctly handled".^[25]
PCA and information theory[edit]
Dimensionality reduction results in a loss of information, in general.
PCA-based dimensionality reduction tends to minimize that information loss,
under certain signal and noise models.
Under the assumption that
x = s + n , {\displaystyle \mathbf {x} =\mathbf {s} +\mathbf {n} ,} {\
displaystyle \mathbf {x} =\mathbf {s} +\mathbf {n} ,}
that is, that the data vector x {\displaystyle \mathbf {x} } \mathbf {x} is the
sum of the desired information-bearing signal s {\displaystyle \mathbf {s} } \
mathbf {s} and a noise signal n {\displaystyle \mathbf {n} } \mathbf {n} one
can show that PCA can be optimal for dimensionality reduction, from an
information-theoretic point-of-view.
In particular, Linsker showed that if s {\displaystyle \mathbf {s} } \mathbf
{s} is Gaussian and n {\displaystyle \mathbf {n} } \mathbf {n} is Gaussian
noise with a covariance matrix proportional to the identity matrix, the PCA
maximizes the mutual information I ( y ; s ) {\displaystyle I(\mathbf {y} ;\
mathbf {s} )} I(\mathbf{y};\mathbf{s}) between the desired information s {\
displaystyle \mathbf {s} } \mathbf {s} and the dimensionality-reduced output y
= W L T x {\displaystyle \mathbf {y} =\mathbf {W} _{L}^{T}\mathbf {x} } \mathbf
{y}=\mathbf{W}_L^T\mathbf{x}.^[26]
If the noise is still Gaussian and has a covariance matrix proportional to the
identity matrix (that is, the components of the vector n {\displaystyle \mathbf
{n} } \mathbf {n} are iid), but the information-bearing signal s {\displaystyle
\mathbf {s} } \mathbf {s} is non-Gaussian (which is a common scenario), PCA at
least minimizes an upper bound on the information loss, which is defined as^
[27]^[28]
I ( x ; s ) − I ( y ; s ) . {\displaystyle I(\mathbf {x} ;\mathbf {s} )-I(\
mathbf {y} ;\mathbf {s} ).} {\displaystyle I(\mathbf {x} ;\mathbf {s} )-I(\
mathbf {y} ;\mathbf {s} ).}
The optimality of PCA is also preserved if the noise n {\displaystyle \mathbf
{n} } \mathbf {n} is iid and at least more Gaussian (in terms of the
Kullback–Leibler divergence) than the information-bearing signal s {\
displaystyle \mathbf {s} } \mathbf {s} .^[29] In general, even if the above
signal model holds, PCA loses its information-theoretic optimality as soon as
the noise n {\displaystyle \mathbf {n} } \mathbf {n} becomes dependent.
Computing PCA using the covariance method[edit]
The following is a detailed description of PCA using the covariance method (see
also here) as opposed to the correlation method.^[30]
The goal is to transform a given data set X of dimension p to an alternative
data set Y of smaller dimension L. Equivalently, we are seeking to find the
matrix Y, where Y is the Karhunen–Loève transform (KLT) of matrix X:
Y = K L T { X } {\displaystyle \mathbf {Y} =\mathbb {KLT} \{\mathbf {X} \}}
\mathbf{Y} = \mathbb{KLT} \{ \mathbf{X} \}
Organize the data set
Suppose you have data comprising a set of observations of p variables, and you
want to reduce the data so that each observation can be described with only L
variables, L < p. Suppose further, that the data are arranged as a set of n
data vectors x 1 … x n {\displaystyle \mathbf {x} _{1}\ldots \mathbf {x} _{n}}
\mathbf{x}_1 \ldots \mathbf{x}_n with each x i {\displaystyle \mathbf {x} _{i}}
\mathbf {x} _{i} representing a single grouped observation of the p variables.
• Write x 1 … x n {\displaystyle \mathbf {x} _{1}\ldots \mathbf {x} _{n}} \
mathbf{x}_1 \ldots \mathbf{x}_n as row vectors, each with p elements.
• Place the row vectors into a single matrix X of dimensions n × p.
Calculate the empirical mean
• Find the empirical mean along each column j = 1, ..., p.
• Place the calculated mean values into an empirical mean vector u of
dimensions p × 1.
u j = 1 n ∑ i = 1 n X i j {\displaystyle u_{j}={\frac {1}{n}}\sum _{i=
1}^{n}X_{ij}} {\displaystyle u_{j}={\frac {1}{n}}\sum _{i=1}^{n}X_{ij}}
Calculate the deviations from the mean
Mean subtraction is an integral part of the solution towards finding a
principal component basis that minimizes the mean square error of approximating
the data.^[31] Hence we proceed by centering the data as follows:
• Subtract the empirical mean vector u T {\displaystyle \mathbf {u} ^{T}} {\
displaystyle \mathbf {u} ^{T}} from each row of the data matrix X.
• Store mean-subtracted data in the n × p matrix B.
B = X − h u T {\displaystyle \mathbf {B} =\mathbf {X} -\mathbf {h} \
mathbf {u} ^{T}} {\displaystyle \mathbf {B} =\mathbf {X} -\mathbf {h} \
mathbf {u} ^{T}}
where h is an n × 1 column vector of all 1s:
h i = 1 for i = 1 , … , n {\displaystyle h_{i}=1\,\qquad \qquad {\
text{for }}i=1,\ldots ,n} {\displaystyle h_{i}=1\,\qquad \qquad {\
text{for }}i=1,\ldots ,n}
In some applications, each variable (column of B) may also be scaled to have a
variance equal to 1 (see Z-score).^[32] This step affects the calculated
principal components, but makes them independent of the units used to measure
the different variables.
Find the covariance matrix
• Find the p × p empirical covariance matrix C from matrix B:
C = 1 n − 1 B ∗ B {\displaystyle \mathbf {C} ={1 \over {n-1}}\mathbf {B} ^
{*}\mathbf {B} }
{\displaystyle \mathbf {C} ={1 \over {n-1}}\mathbf {B} ^{*}\mathbf {B} }
where ∗ {\displaystyle *} * is the conjugate transpose operator. If B
consists entirely of real numbers, which is the case in many applications,
the "conjugate transpose" is the same as the regular transpose.
• The reasoning behind using n − 1 instead of n to calculate the covariance
is Bessel's correction.
Find the eigenvectors and eigenvalues of the covariance matrix
• Compute the matrix V of eigenvectors which diagonalizes the covariance
matrix C:
V − 1 C V = D {\displaystyle \mathbf {V} ^{-1}\mathbf {C} \mathbf {V} =\
mathbf {D} }
{\displaystyle \mathbf {V} ^{-1}\mathbf {C} \mathbf {V} =\mathbf {D} }
where D is the diagonal matrix of eigenvalues of C. This step will
typically involve the use of a computer-based algorithm for computing
eigenvectors and eigenvalues. These algorithms are readily available as
sub-components of most matrix algebra systems, such as SAS,^[33] R, MATLAB,
^[34]^[35] Mathematica,^[36] SciPy, IDL (Interactive Data Language), or GNU
Octave as well as OpenCV.
• Matrix D will take the form of an p × p diagonal matrix, where
D k ℓ = λ k for k = ℓ {\displaystyle D_{k\ell }=\lambda _{k}\qquad {\text
{for }}k=\ell }
{\displaystyle D_{k\ell }=\lambda _{k}\qquad {\text{for }}k=\ell }
is the jth eigenvalue of the covariance matrix C, and
D k ℓ = 0 for k ≠ ℓ . {\displaystyle D_{k\ell }=0\qquad {\text{for }}k\neq
\ell .}
{\displaystyle D_{k\ell }=0\qquad {\text{for }}k\neq \ell .}
• Matrix V, also of dimension p × p, contains p column vectors, each of
length p, which represent the p eigenvectors of the covariance matrix C.
• The eigenvalues and eigenvectors are ordered and paired. The jth eigenvalue
corresponds to the jth eigenvector.
• Matrix V denotes the matrix of right eigenvectors (as opposed to left
eigenvectors). In general, the matrix of right eigenvectors need not be the
(conjugate) transpose of the matrix of left eigenvectors.
Rearrange the eigenvectors and eigenvalues
• Sort the columns of the eigenvector matrix V and eigenvalue matrix D in
order of decreasing eigenvalue.
• Make sure to maintain the correct pairings between the columns in each
matrix.
Compute the cumulative energy content for each eigenvector
• The eigenvalues represent the distribution of the source data's energy^[
clarification needed] among each of the eigenvectors, where the
eigenvectors form a basis for the data. The cumulative energy content g for
the jth eigenvector is the sum of the energy content across all of the
eigenvalues from 1 through j:
g j = ∑ k = 1 j D k k for j = 1 , … , p {\displaystyle g_{j}=\sum _{k=
1}^{j}D_{kk}\qquad {\text{for }}j=1,\dots ,p} {\displaystyle g_{j}=\sum
_{k=1}^{j}D_{kk}\qquad {\text{for }}j=1,\dots ,p}^[citation needed]
Select a subset of the eigenvectors as basis vectors
• Save the first L columns of V as the p × L matrix W:
W k l = V k ℓ for k = 1 , … , p ℓ = 1 , … , L {\displaystyle W_{kl}=V_{k\
ell }\qquad {\text{for }}k=1,\dots ,p\qquad \ell =1,\dots ,L}
{\displaystyle W_{kl}=V_{k\ell }\qquad {\text{for }}k=1,\dots ,p\qquad \ell
=1,\dots ,L}
where
1 ≤ L ≤ p . {\displaystyle 1\leq L\leq p.}
{\displaystyle 1\leq L\leq p.}
• Use the vector g as a guide in choosing an appropriate value for L. The
goal is to choose a value of L as small as possible while achieving a
reasonably high value of g on a percentage basis. For example, you may want
to choose L so that the cumulative energy g is above a certain threshold,
like 90 percent. In this case, choose the smallest value of L such that
g L g p ≥ 0.9 {\displaystyle {\frac {g_{L}}{g_{p}}}\geq 0.9}
{\displaystyle {\frac {g_{L}}{g_{p}}}\geq 0.9}
Project the data onto the new basis
• The projected data points are the rows of the matrix
T = B ⋅ W {\displaystyle \mathbf {T} =\mathbf {B} \cdot \mathbf {W} }
{\displaystyle \mathbf {T} =\mathbf {B} \cdot \mathbf {W} }
That is, the first column of T {\displaystyle \mathbf {T} } \mathbf {T} is the
projection of the data points onto the first principal component, the second
column is the projection onto the second principal component, etc.
Derivation of PCA using the covariance method[edit]
Let X be a d-dimensional random vector expressed as column vector. Without loss
of generality, assume X has zero mean.
We want to find ( ∗ ) {\displaystyle (\ast )} {\displaystyle (\ast )} a d × d
orthonormal transformation matrix P so that PX has a diagonal covariance matrix
(that is, PX is a random vector with all its distinct components pairwise
uncorrelated).
A quick computation assuming P {\displaystyle P} P were unitary yields:
cov ( P X ) = E [ P X ( P X ) ∗ ] = E [ P X X ∗ P ∗ ] = P E [ X
X ∗ ] P ∗ = P cov ( X ) P − 1 {\displaystyle {\begin{aligned}\
operatorname {cov} (PX)&=\operatorname {E} [PX~(PX)^{*}]\\&=\operatorname
{E} [PX~X^{*}P^{*}]\\&=P\operatorname {E} [XX^{*}]P^{*}\\&=P\operatorname
{cov} (X)P^{-1}\\\end{aligned}}} {\displaystyle {\begin{aligned}\
operatorname {cov} (PX)&=\operatorname {E} [PX~(PX)^{*}]\\&=\operatorname
{E} [PX~X^{*}P^{*}]\\&=P\operatorname {E} [XX^{*}]P^{*}\\&=P\operatorname
{cov} (X)P^{-1}\\\end{aligned}}}
Hence ( ∗ ) {\displaystyle (\ast )} {\displaystyle (\ast )} holds if and only
if cov ( X ) {\displaystyle \operatorname {cov} (X)} \operatorname {cov}(X)
were diagonalisable by P {\displaystyle P} P.
This is very constructive, as cov(X) is guaranteed to be a non-negative
definite matrix and thus is guaranteed to be diagonalisable by some unitary
matrix.
Covariance-free computation[edit]
In practical implementations, especially with high dimensional data (large p),
the naive covariance method is rarely used because it is not efficient due to
high computational and memory costs of explicitly determining the covariance
matrix. The covariance-free approach avoids the np^2 operations of explicitly
calculating and storing the covariance matrix X^TX, instead utilizing one of
matrix-free methods, for example, based on the function evaluating the product
X^T(X r) at the cost of 2np operations.
Iterative computation[edit]
One way to compute the first principal component efficiently^[37] is shown in
the following pseudo-code, for a data matrix X with zero mean, without ever
computing its covariance matrix.
r = a random vector of length p
r = r / norm(r)
do c times:
s = 0 (a vector of length p)
for each row x in X
s = s + (x ⋅ r) x
λ = r^Ts // λ is the eigenvalue
error = |λ ⋅ r − s|
r = s / norm(s)
exit if error < tolerance
return λ, r