-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathPCA.Rmd
369 lines (279 loc) · 16.9 KB
/
PCA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
---
title: "PCA analysis of NYCABS datase"
author: "Katerina Dimitrova, Jose Romero, Sergi Munoz"
date: "March 18, 2018"
output: pdf_document
---
```{r,include=FALSE}
#knitr::opts_chunk$set(echo = TRUE)
```
```{r,include=FALSE}
#setwd("C://Users/Sergi/Desktop/Sergi/ADEI") #Change
setwd("D:/ADEI/ADEI.git/trunk") #Change
```
#Previous work
## Load requiered packages
```{r,include=FALSE}
rm(list=ls())
requiredPackages <- c("effects","FactoMineR","car", "factoextra","RColorBrewer","ggplot2","missMDA","mvoutlier","dplyr","ggmap","ggthemes","knitr","MVA")
missingPackages <- requiredPackages[!(requiredPackages %in% installed.packages()[,"Package"])]
if(length(missingPackages)) install.packages(missingPackages)
lapply(requiredPackages, require, character.only = TRUE)
# Useful function
calcQ <- function(x) {
s.x <- summary(x)
iqr<-s.x[5]-s.x[2]
list(souti=s.x[2]-3*iqr, mouti=s.x[2]-1.5*iqr, min=s.x[1], q1=s.x[2], q2=s.x[3], q3=s.x[5], max=s.x[6], mouts=s.x[5]+1.5*iqr, souts=s.x[5]+3*iqr ) }
```
#PCA analysis
1. The Kaiser rule is to drop all components with eigenvalues under 1.0 for a normalized PcA
According to the Elbow rule when the drop ceases and the curve makes an elbow toward less steep declinewe should drop all further components after the one starting the elbow.
##I. I. Eigenvalues and axes
For the PCA analysis we take all numerical variables as active, where TotalAmount and Anytip are suplementary.
```{r}
load("Taxi5000_raw_DataDefinitivev2.RData")
library(FactoMineR)
names (df)
vars_con_pca<-c(6,7,8,9,10,11,12,13,14,15,16,17,18,22,23,24,25,26)
#From te plot we see that the variables "Trip_distance", "Trip_length", "Travel_time" and "Fare_amount" are strong correlated to "Total_amount", where as the pickup and dropoff latitude and longtitude are correlated with each other but dont have any relation to "Total_amount". The rest of the variables such as the "Passanger_count" and the different taxes and fees are also independent.
res.pca<-PCA(df[,vars_con_pca], quanti.sup = 13, quali.sup = 14, ncp = 6 ) # TotalAmount and AnyTip
fviz_eig(res.pca, addlabels = TRUE)
fviz_eig(res.pca, choice = "eigenvalue",addlabels = TRUE)
```
##II. Individuals point of view
## Look at variables that are too contributive
```{r}
summary(res.pca, dig = 2, nbelements = 17, nbind=3, ncp=4)
#The summary confirms the correlations between the variables that we already interpreted from the plots earlier.
#The plot show us that individuals that had to pay more tend to leave a tip.
plot.PCA(res.pca, choix=c("ind"),cex=0.8,col.ind="grey80",select="contrib15",axes=c(1,2))
#DIMENSION1
#Since the multivariant detection didnt manage to find outliers well enogh we are going to obtain them from the PCs. We will obtain outliers from the first 4 dimentions.
#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,1])
iqrvar<-IQR(res.pca$ind$coord[,1])
quantil3<-quantile(res.pca$ind$coord[,1], .75);quantil3 #get 3rd quartile
outliers<-which(res.pca$ind$coord[,1]>(iqrvar*3)+quantil3);length(outliers)
df$f.outlierPCAd1<-0
df[outliers,"f.outlierPCAd1"]<-1
df$f.outlierPCAd1<-factor(df$f.outlierPCAd1,labels=c("NoOutDim1", "YesOutDim1"))
summary(df$f.outlierPCAd1)
names(df)
#catdes(,names(df)[c(22)])
#DIMENSION2
#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,2])
iqrvar<-IQR(res.pca$ind$coord[,2])
quantil3<-quantile(res.pca$ind$coord[,2], .75);quantil3 #get 3rd quartile
outliers2<-which(res.pca$ind$coord[,2]>(iqrvar*3)+quantil3);length(outliers2)
df$f.outlierPCAd2<-0
df[outliers2,"f.outlierPCAd2"]<-1
df$f.outlierPCAd2<-factor(df$f.outlierPCAd2,labels=c("NoOutDim2", "YesOutDim2"))
summary(df$f.outlierPCAd2)
names(df)
#DIMENSION3
#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,3])
iqrvar<-IQR(res.pca$ind$coord[,3])
quantil3<-quantile(res.pca$ind$coord[,3], .75);quantil3 #get 3rd quartile
outliers3<-which(res.pca$ind$coord[,3]>(iqrvar*3)+quantil3);length(outliers3)
df$f.outlierPCAd3<-0
df$f.outlierPCAd3<-factor(df$f.outlierPCAd3,labels=c("NoOutDim3"))
summary(df$f.outlierPCAd3)
names(df)
#DIMENSION4
#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,4])
iqrvar<-IQR(res.pca$ind$coord[,4])
quantil3<-quantile(res.pca$ind$coord[,4], .75);quantil3 #get 3rd quartile
outliers4<-which(res.pca$ind$coord[,4]>(iqrvar*3)+quantil3);length(outliers4)
df$f.outlierPCAd4<-0
df$f.outlierPCAd4<-factor(df$f.outlierPCAd4,labels=c("NoOutDim4"))
summary(df$f.outlierPCAd4)
names(df)
#Finally we obtained 121 extreme outliers.
llvout<- c(outliers, outliers2, outliers3, outliers4);length(llvout)
df$f.outlierPCA<-0
df[llvout,"f.outlierPCA"]<-1
df$f.outlierPCA<-factor(df$f.outlierPCA,labels=c("NoOut", "YesOut"))
summary(df$f.outlierPCA)
names(df)
catdes_var <-c(1,4:48)
# We see that the most linked variables to the outliers are pickup and droppoff latitude and longtitude. All of the rides with a small MTA-Tax and improvement surchase are outliers. Also all rides with RateCodeID=Special rate or dispatched trip type are outliers. So there are some categories that are definetly different.
catdes(df[,catdes_var], num.var = 46)
#we execute Pca again with the too contributive individuals as suplementary
contr_ind = c(outlier, outliers2, outliers3, outliers4)
res.pca<-PCA(df[,vars_con_pca],ind.sup=contr_ind, quanti.sup = 13, quali.sup = 14, ncp = 6 ) # TotalAmount and AnyTip
barplot(res.pca$eig[,1], main="Eigenvalues", names.arg = paste("dim", 1:nrow(res.pca$eig)))
# With the PCA transformation the PC1 covers 31.6% of the variance, PC2 - 14.8%, PCA3 - 9.9%, PCA4 - 8.2% and the rest around 5-6% or less.
kaiser <- res.pca$eig[1:7,1] #keep only EV >=1 ->first 7
#If we use the kaiser rule we have to keep all EV greater than 1, which results in saving the first 7 dimetions and 84.2% of the variance.
#facto extra
fviz_eig(res.pca, addlabels = TRUE)
fviz_eig(res.pca, choice = "eigenvalue",addlabels = TRUE)
#According to the elbow rule we have to take the first 4 dimentions as the slope of the graphic shows. We are going to follow this rule, even though makes no difference. 64.5 % of the variance is saved
elbow <- res.pca$eig[1:4,1]
```
##III Interpret axis
```{r}
# Interential criteria
dimdesc (res.pca, axes=1:4)
#The first Dimention is best described by the quantative variables Total_amount, trip_distance and Fare_amount. The second one - by the coordinates of the beginning and end of the trip, The third one - by the MTA_tax and improvement_surcharge and the forth one - by the longtitute of the pickup and dropoff point.
plot(res.pca,choix="var", cex = 0.75)
plot(res.pca,choix="var", cex = 0.75, axes = (3:4))# 3rd and 4th PCA
#modern factoextra
# Components with higher cos^2 are more importaint to interpret both active and supplementary observations. Thus as shown in the graph below the most importaint ones are trip_length, trip_distance, fare_amout. Less importaint are MTA-tac, tolls_amount, improvement_surchase etc.
fviz_pca_var(res.pca,col.var="cos2", repel=TRUE)+scale_color_gradient2(low="green", mid="blue", high="red",midpoint=0.5)+theme_bw()
```
#Leveling
For better understanding of the results we will name the levels of the factorial variables
```{r}
levels(df$f.fare_amount)
levels(df$f.passenger) <- c("onePassager","multiPassagers")
levels(df$f.pickup_longitude) <- c("p.Y1","p.Y2","p.Y3","p.Y4") #pickup->p., longtitude->Y
levels(df$f.pickup_latitude) <- c("p.X1","p.X2","p.X3","p.X4") #pickup->p., latitude->X
levels(df$f.distance) <- c("Dist1","Dist2", "Dist3", "Dist4")#distance 1-4, 1 shortest
levels(df$f.dropoff_longitude) <- c("d.Y1","d.Y2", "d.Y3", "d.Y4")
levels(df$f.dropoff_latitude) <- c("d.X1","d.X2", "d.X3", "d.X4")
levels(df$f.fare_amount) <- c("FAmount1","FAmount2", "FAmount3", "FAmount4")
levels(df$f.extra) <- c("smallExtra","highExtra")
levels(df$f.MTA_tax) <- c("smallMTA","highMTA")
levels(df$f.Improvement_surcharge) <- c("smallSurcharge","highSurcharge")
levels(df$f.tip_amount) <- c("smallTip","highTip")
levels(df$f.toll) <- c("smallToll","highToll")
levels(df$f.total) <- c("CheapestTrip","CheapTrip","MediumTrip","ExpensiveTrip")
levels(df$f.ttime) <- c("Time1","Time2", "Time3", "Time4")
levels(df$f.espeed) <- c("Speed1","Speed2", "Speed3","Speed4")
levels(df$f.outlierPCAd1) <- c("Normald1", "Outlierd1")
levels(df$f.outlierPCAd2) <- c("Normald2", "Outlierd2")
levels(df$f.outlierPCAd3) <- c("Normald3", "Outlierd3")
levels(df$f.outlierPCAd4) <- c("Normald4", "Outlierd4")
levels(df$f.outlierPCA) <- c("Normal", "Outlier")
```
##IV PCA execution with supplementary individuals
```{r}
vars_con_pca <- names(df)[c(6:16,18,23:26)]
# We do a PCA analysis using the factorial variables Fare amount, total and the pickup perio in order to interpret the results even better. We use the discovered outliers as suplementary individuals.
res.pca<-PCA(df[,c(vars_con_pca, "f.fare_amount", "f.total", "pick_up_period", "f.passenger", "f.pickup_longitude", "f.pickup_latitude", "f.dropoff_longitude", "f.dropoff_latitude", "f.Improvement_surcharge", "f.MTA_tax")], ncp=6, quanti.sup=which(vars_con_pca=="Total_amount"), ind.sup = contr_ind, quali.sup = 17:26, graph= FALSE)
plot(res.pca,choix="var", cex = 0.75)
plot(res.pca,choix="var", cex = 0.75, axes = (3:4))# 3rd and 4th PCA
fviz_pca_var(res.pca,col.var="cos2", repel=TRUE)+scale_color_gradient2(low="green", mid="blue", high="red",midpoint=0.5)+theme_bw()
#We can see that trips in the afternoon tend to be longer and thus also more expensive than the ones during night and in the morning.
# we see That the qualitative varibles are close to the center, so they are not of a big importaince for the principal components.
plot.PCA(res.pca, choix=c("ind"),cex=0.8,col.ind="grey80",select="contrib15",axes=c(1,2))
```
#Hierarchical clustering
We generate 6 clusters using the hierarchical method, taking the projection obtained by the PCA as a source dataset.
The resulting table is showing the distribution taken between the different clusters (excluding the multivariant outliers)
```{r}
library(FactoMineR)
res.hcpc <-HCPC(res.pca, nb.clust = 6, order=TRUE)
table (res.hcpc$data.clust$clust)
```
##Variable description
Below is listed the categorical description for each cluster and our explanation as it follows.
```{r}
#Block A descripcion per variables
res.hcpc$desc.var
```
So, first we can assume that all the null hipothesis of independence for the qualitative variables taken can be denied by the chisquare test. It means that all of them have been used somehow to calculate the clustering distances and their splittings, as expected (because we took them from the axis interpretation analysis so we knew they were significative for PCA projections).
Diving inside each category, we can determine the following characterization:
### Category 1:
It is defined by individuals contained between this coordinates rangs:
- pickup_latitude=(40.8,40.91]
- pickup_longitude=(-73.95,-73.92]
- dropoff_latitude=(40.8,40.91]
- dropoff_longitude=(-73.95,-73.91]
It also have a significative representation (almost 48 in Cla/Mod) of rows which total amount are contained in (-1,7.8] rang.
###Category 2:
Very similar to previous case, defined also by coordinates values, but this times the rangs are the nexts:
- f.pickup_latitude=(40.75,40.8]
- f.pickup_longitude=(-73.92,-73.79]
- f.pickup_latitude=(40.75,40.8]
- f.dropoff_longitude=(-73.91,-73.75]
###Category 3
As category 1 and 2, it seems to be characterized depending on the coordinates points where the client has been picked up and dropped off. This time, this rangs are:
- f.pickup_latitude=(40.58,40.69]
- f.pickup_longitude=(-74.03,-73.96]
- f.dropoff_longitude=(-74.03,-73.97]
- f.dropoff_latitude=(40.58,40.69]
However, we can appreciate that in any of the different clusters the rangs that defines the cluster itself are being overlapped (which makes totally sense).
Furthermore, it also have a significative representation (almost 41%) of the rows which, this time, its total amount rang is: (11,16.6].
###Category 4
This category is determined, with a huge difference between its 2 first v.test values, for passenger variable. Concretely, 100% of their individuals are included in (1,6] rang for f.passenger.
###Category 5
This category is gathering all the rows with MTA_tax = 0 and also the 90% of the individuals which its improvement surcharge is 0 are in this category.
So MTA_tax and Improvement_surcharge explains the behaviour of category 5 rows, and 63% of individuals of this category it has its total_amount value compressed between (16.6,46].
###Category 6
Definitely, category 6 is defined by those rows which their total_amount is in the rang: (16.6,46] and their fare_amount between (14,42.5] (so , the most expensive one).
We can appreciate how 100% of this individuals have:
- MTA_tax = 0.5
- Improvement surcharge != 0
### $quanti section
We can observe how all the peculiarities pointed at the cluster analysis are proved by the quantitative variables output.
- For quanti 1, 2 and 3 the most significative quantitative variables are latitudes and longitudes
- For quanti 4 we have passenger_count at the top
- quanti 5 has a huge negative value for MTA_tax (which make sense with the description above)
- And in quanti 6 Total_amount, trip_distance and Fare_amount are distinguished as more correlated.
##Axes description
At this point, this output does not help anymore to detail our clusters. But we can also see how the interpretation of axis made in a past section corresponds to the characterization of the clusters, being the variables that explain the most each specific dimension the ones that are also distinguished for each cluster who has a higher v.test value for the dimension in question.
```{r}
#Block B descripcion per eixos
res.hcpc$desc.axes
```
##Invidual analysis
Again, this command can help us now to confirm the conclusions made until now.
As an example, we will look a paragon of C6, and how its total_amount is served in some middle-point of the last rang (16.6,46] for this variable (total_amount = 26.3), and also look at how a distinguished C6 row has one of the possible highest values for total_amount (= 44.8).
If we keep tracking for the rest of the clusters, we can assume that the conclusions made below are concordant.
```{r}
#Block C individus
res.hcpc$desc.ind
#parangon C6
df["678042",]
#distinguished C6
df["285458",]
```
##Assigning clusters groups
Now we assign to each row the cluster group decided by HPC method and we also consider as group 7 the outliers (which they haven't been taking into consideration until now).
```{r}
#Donar-li una classe (the last one) a tots els outliers multidimensionals (sup.)
df$claHP<-7
df[row.names(res.hcpc$data.clust),"claHP"]<-res.hcpc$data.clust$clust
table(df$claHP)
```
#K-Means Classification
We execute kmeans command defining 6 clusters in order to get the same number of groups as in the hierarchical process.
```{r}
ppcc<-res.pca$ind$coord[,1:6]
dim(ppcc)
kc<-kmeans(ppcc,6,iter.max = 30, trace=T)
table(kc$cluster)
```
##Assigning clusters groups
As we also did before in HPC, we assign the clusters in a way that group 7 is taken by the outliers.
```{r}
df$claKM<-7
df[names(kc$cluster),"claKM"]<-kc$cluster
kc$betweenss/kc$totss
table(df$claKM)
```
##Characterization of Kmeans clustering
As we didn't manage to execute catdes command, we've tried to be a bit creative and search for internet. At last, we found interesting "$centers" and we realized we could try to give it a try in order to get some notion about wether the clustering done by Kmeans was similar or not to HCPC.
```{r}
kc$centers
#catdes(df,47)
#veure si s'han posat d'acord o no
table(df$claHP,df$claKM)
```
The output it's a bit messy, but we can certainly appreciate how cluster1 it's centred by Dimension 1 (which is the axis that increases with total_amount prices) and, at least, also check how Dimension 2 and cluster4 is very correlated.
These two clusters, as we can observe in the next section, are precisely associated with cluster6 and cluster4 (in this order) by the labels done in HCPC. So, even though we haven't been able to interpret the categorical description, we can predict that both methods (kmeans and hierarchical) are actually generating a very similar groups of individuals.
###Re-labeling
After all, we generate a new label for Kmeans groups so the cluster numbers are referring to the same group and avoid further confusions.
To finalize, we check the diagonal summatory of the number of invidiuals from the contingency table generated, to have an idea of the bias taken by kmeans respect to HCPC.
```{r}
df$claHP<-factor(df$claHP,labels=paste("kHP-",1:7))
df$claKM<-factor(df$claKM,levels=c(3,2,5,1,4,6,7),labels=c("kKM-3","kKM-2","kKM-5","kKM-1","kKM-4","kKM-6","kKM-7"))
tt<-table(df$claHP,df$claKM)
tt
sum(diag(tt)/sum(tt))
```