PCA.Rmd

---
title: "PCA analysis of NYCABS datase"
author: "Katerina Dimitrova, Jose Romero, Sergi Munoz"
date: "March 18, 2018"
output: pdf_document
---

```{r,include=FALSE}
#knitr::opts_chunk$set(echo = TRUE)

```

```{r,include=FALSE}
#setwd("C://Users/Sergi/Desktop/Sergi/ADEI") #Change 
setwd("D:/ADEI/ADEI.git/trunk") #Change

```

#Previous work

## Load requiered packages

```{r,include=FALSE}
rm(list=ls())
requiredPackages <- c("effects","FactoMineR","car", "factoextra","RColorBrewer","ggplot2","missMDA","mvoutlier","dplyr","ggmap","ggthemes","knitr","MVA")
missingPackages <- requiredPackages[!(requiredPackages %in% installed.packages()[,"Package"])]

if(length(missingPackages)) install.packages(missingPackages)
lapply(requiredPackages, require, character.only = TRUE)

# Useful function
calcQ <- function(x) {
  s.x <- summary(x)
  iqr<-s.x[5]-s.x[2]
  list(souti=s.x[2]-3*iqr, mouti=s.x[2]-1.5*iqr, min=s.x[1], q1=s.x[2], q2=s.x[3],        q3=s.x[5], max=s.x[6], mouts=s.x[5]+1.5*iqr, souts=s.x[5]+3*iqr ) }

```


#PCA analysis
1. The Kaiser rule is to drop all components with eigenvalues under 1.0 for a normalized PcA
   According to the Elbow rule when the drop ceases and the curve makes an elbow toward less steep declinewe should drop all further components after the one starting the elbow.
   
##I. I. Eigenvalues and  axes 
For the PCA analysis we take all numerical variables as active, where TotalAmount and Anytip are suplementary.
```{r}
load("Taxi5000_raw_DataDefinitivev2.RData")
library(FactoMineR)
names (df)
vars_con_pca<-c(6,7,8,9,10,11,12,13,14,15,16,17,18,22,23,24,25,26) 

#From te plot we see that the variables "Trip_distance", "Trip_length", "Travel_time" and "Fare_amount" are strong correlated to "Total_amount", where as the pickup and dropoff latitude and longtitude are correlated with each other but dont have any relation to "Total_amount". The rest of the variables such as the "Passanger_count" and the different taxes and fees are also independent.
res.pca<-PCA(df[,vars_con_pca], quanti.sup = 13, quali.sup = 14, ncp = 6 ) # TotalAmount and AnyTip

fviz_eig(res.pca, addlabels = TRUE)
fviz_eig(res.pca, choice = "eigenvalue",addlabels = TRUE)

```

##II.  Individuals point of view
## Look at variables that are too contributive
```{r}
summary(res.pca, dig = 2, nbelements = 17, nbind=3, ncp=4)
#The summary confirms the correlations between the variables that we already interpreted from the plots earlier.
#The plot show us that individuals that had to pay more tend to leave a tip.
plot.PCA(res.pca, choix=c("ind"),cex=0.8,col.ind="grey80",select="contrib15",axes=c(1,2))

#DIMENSION1
#Since the multivariant detection didnt manage to find outliers well enogh we are going to obtain them from the PCs. We will obtain outliers from the first 4 dimentions. 

#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,1])
iqrvar<-IQR(res.pca$ind$coord[,1])
quantil3<-quantile(res.pca$ind$coord[,1], .75);quantil3 #get 3rd quartile
outliers<-which(res.pca$ind$coord[,1]>(iqrvar*3)+quantil3);length(outliers)

df$f.outlierPCAd1<-0
df[outliers,"f.outlierPCAd1"]<-1
df$f.outlierPCAd1<-factor(df$f.outlierPCAd1,labels=c("NoOutDim1", "YesOutDim1"))
summary(df$f.outlierPCAd1)
names(df)
#catdes(,names(df)[c(22)])

#DIMENSION2
#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,2])
iqrvar<-IQR(res.pca$ind$coord[,2])
quantil3<-quantile(res.pca$ind$coord[,2], .75);quantil3 #get 3rd quartile
outliers2<-which(res.pca$ind$coord[,2]>(iqrvar*3)+quantil3);length(outliers2)


df$f.outlierPCAd2<-0
df[outliers2,"f.outlierPCAd2"]<-1
df$f.outlierPCAd2<-factor(df$f.outlierPCAd2,labels=c("NoOutDim2", "YesOutDim2"))
summary(df$f.outlierPCAd2)
names(df)

#DIMENSION3
#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,3])
iqrvar<-IQR(res.pca$ind$coord[,3])
quantil3<-quantile(res.pca$ind$coord[,3], .75);quantil3 #get 3rd quartile
outliers3<-which(res.pca$ind$coord[,3]>(iqrvar*3)+quantil3);length(outliers3)


df$f.outlierPCAd3<-0
df$f.outlierPCAd3<-factor(df$f.outlierPCAd3,labels=c("NoOutDim3"))
summary(df$f.outlierPCAd3)
names(df)

#DIMENSION4
#characteristic of extreme otliers in dim1
summary(res.pca$ind$coord[,4])
iqrvar<-IQR(res.pca$ind$coord[,4])
quantil3<-quantile(res.pca$ind$coord[,4], .75);quantil3 #get 3rd quartile
outliers4<-which(res.pca$ind$coord[,4]>(iqrvar*3)+quantil3);length(outliers4)


df$f.outlierPCAd4<-0
df$f.outlierPCAd4<-factor(df$f.outlierPCAd4,labels=c("NoOutDim4"))
summary(df$f.outlierPCAd4)
names(df)

#Finally we obtained 121 extreme outliers. 
llvout<- c(outliers, outliers2, outliers3, outliers4);length(llvout)
df$f.outlierPCA<-0
df[llvout,"f.outlierPCA"]<-1
df$f.outlierPCA<-factor(df$f.outlierPCA,labels=c("NoOut", "YesOut"))
summary(df$f.outlierPCA)
names(df)
catdes_var <-c(1,4:48)
# We see that the most linked variables to the outliers are pickup and droppoff latitude and longtitude. All of the rides with a small MTA-Tax and improvement surchase are outliers. Also all rides with  RateCodeID=Special rate or dispatched trip type are outliers. So there are some categories that are definetly different.
catdes(df[,catdes_var], num.var = 46)

#we execute Pca again with the too contributive individuals as suplementary
contr_ind = c(outlier, outliers2, outliers3, outliers4)
res.pca<-PCA(df[,vars_con_pca],ind.sup=contr_ind, quanti.sup = 13, quali.sup = 14, ncp = 6 ) # TotalAmount and AnyTip


barplot(res.pca$eig[,1], main="Eigenvalues", names.arg = paste("dim", 1:nrow(res.pca$eig)))
# With the PCA transformation the PC1 covers 31.6% of the variance, PC2 - 14.8%, PCA3 - 9.9%, PCA4 - 8.2% and the rest around 5-6% or less.
kaiser <- res.pca$eig[1:7,1] #keep only EV >=1 ->first 7
#If we use the kaiser rule we have to keep all EV greater than 1, which results in saving the first 7 dimetions and 84.2% of the variance. 
#facto extra
fviz_eig(res.pca, addlabels = TRUE)
fviz_eig(res.pca, choice = "eigenvalue",addlabels = TRUE)
#According to the elbow rule we have to take the first 4 dimentions as the slope of the graphic shows. We are going to follow this rule, even though makes no difference. 64.5 % of the variance is saved
elbow <- res.pca$eig[1:4,1]

```


##III Interpret axis
```{r}
# Interential criteria

dimdesc (res.pca, axes=1:4)
#The first Dimention is best described by the quantative variables Total_amount, trip_distance and Fare_amount. The second one - by the coordinates of the beginning and end of the trip, The third one - by the MTA_tax and improvement_surcharge and the forth one - by the longtitute of the pickup and dropoff point.
plot(res.pca,choix="var", cex = 0.75)
plot(res.pca,choix="var", cex = 0.75, axes = (3:4))# 3rd and 4th PCA
#modern factoextra
# Components with higher cos^2 are more importaint to interpret both active and supplementary observations. Thus as shown in the graph below the most importaint ones are trip_length, trip_distance, fare_amout. Less importaint are MTA-tac, tolls_amount, improvement_surchase etc.
fviz_pca_var(res.pca,col.var="cos2", repel=TRUE)+scale_color_gradient2(low="green", mid="blue", high="red",midpoint=0.5)+theme_bw()


```
#Leveling 
For better understanding of the results we will name the levels of the factorial variables

```{r}
levels(df$f.fare_amount)
levels(df$f.passenger) <- c("onePassager","multiPassagers")
levels(df$f.pickup_longitude) <- c("p.Y1","p.Y2","p.Y3","p.Y4") #pickup->p., longtitude->Y
levels(df$f.pickup_latitude) <- c("p.X1","p.X2","p.X3","p.X4") #pickup->p., latitude->X
levels(df$f.distance) <- c("Dist1","Dist2", "Dist3", "Dist4")#distance 1-4, 1 shortest
levels(df$f.dropoff_longitude) <- c("d.Y1","d.Y2", "d.Y3", "d.Y4")
levels(df$f.dropoff_latitude) <- c("d.X1","d.X2", "d.X3", "d.X4")
levels(df$f.fare_amount) <- c("FAmount1","FAmount2", "FAmount3", "FAmount4")
levels(df$f.extra) <- c("smallExtra","highExtra")
levels(df$f.MTA_tax) <- c("smallMTA","highMTA")
levels(df$f.Improvement_surcharge) <- c("smallSurcharge","highSurcharge")
levels(df$f.tip_amount) <- c("smallTip","highTip")
levels(df$f.toll) <- c("smallToll","highToll")
levels(df$f.total) <- c("CheapestTrip","CheapTrip","MediumTrip","ExpensiveTrip")
levels(df$f.ttime) <- c("Time1","Time2", "Time3", "Time4")
levels(df$f.espeed) <- c("Speed1","Speed2", "Speed3","Speed4")
levels(df$f.outlierPCAd1) <- c("Normald1", "Outlierd1")
levels(df$f.outlierPCAd2) <- c("Normald2", "Outlierd2")
levels(df$f.outlierPCAd3) <- c("Normald3", "Outlierd3")
levels(df$f.outlierPCAd4) <- c("Normald4", "Outlierd4")
levels(df$f.outlierPCA) <- c("Normal", "Outlier")

```

##IV PCA execution with supplementary individuals
```{r}
vars_con_pca <- names(df)[c(6:16,18,23:26)]
# We do a PCA analysis using the factorial variables Fare amount, total and the pickup perio in order to interpret the results even better. We use the discovered outliers as suplementary individuals.

res.pca<-PCA(df[,c(vars_con_pca, "f.fare_amount", "f.total", "pick_up_period", "f.passenger", "f.pickup_longitude", "f.pickup_latitude", "f.dropoff_longitude", "f.dropoff_latitude", "f.Improvement_surcharge", "f.MTA_tax")], ncp=6, quanti.sup=which(vars_con_pca=="Total_amount"), ind.sup = contr_ind, quali.sup = 17:26, graph= FALSE)

plot(res.pca,choix="var", cex = 0.75)
plot(res.pca,choix="var", cex = 0.75, axes = (3:4))# 3rd and 4th PCA
fviz_pca_var(res.pca,col.var="cos2", repel=TRUE)+scale_color_gradient2(low="green", mid="blue", high="red",midpoint=0.5)+theme_bw()

#We can see that trips in the afternoon tend to be longer and thus also more expensive than the ones during night and in the morning.
# we see That the qualitative varibles are close to the center, so they are not of a big importaince for the principal components.
plot.PCA(res.pca, choix=c("ind"),cex=0.8,col.ind="grey80",select="contrib15",axes=c(1,2))


```


#Hierarchical clustering

We generate 6 clusters using the hierarchical method, taking the projection obtained by the PCA as a source dataset.
The resulting table is showing the distribution taken between the different clusters (excluding the multivariant outliers)

```{r}
library(FactoMineR)
res.hcpc <-HCPC(res.pca, nb.clust = 6, order=TRUE)
table (res.hcpc$data.clust$clust)
```

##Variable description
Below is listed the categorical description for each cluster and our explanation as it follows.

```{r}

#Block A descripcion per variables
res.hcpc$desc.var
```

So, first we can assume that all the null hipothesis of independence for the qualitative variables taken can be denied by the chisquare test. It means that all of them have been used somehow to calculate the clustering distances and their splittings, as expected (because we took them from the axis interpretation analysis so we knew they were significative for PCA projections).

Diving inside each category, we can determine the following characterization:
### Category 1:
It is defined by individuals contained between this coordinates rangs:
- pickup_latitude=(40.8,40.91]
- pickup_longitude=(-73.95,-73.92]

- dropoff_latitude=(40.8,40.91]
- dropoff_longitude=(-73.95,-73.91]

It also have a significative representation (almost 48 in Cla/Mod) of rows which total amount are contained in (-1,7.8] rang.

###Category 2:
Very similar to previous case, defined also by coordinates values, but this times the rangs are the nexts:
- f.pickup_latitude=(40.75,40.8] 
- f.pickup_longitude=(-73.92,-73.79] 

- f.pickup_latitude=(40.75,40.8] 
- f.dropoff_longitude=(-73.91,-73.75] 

###Category 3
As category 1 and 2, it seems to be characterized depending on the coordinates points where the client has been picked up and dropped off. This time, this rangs are:
- f.pickup_latitude=(40.58,40.69]
- f.pickup_longitude=(-74.03,-73.96] 
- f.dropoff_longitude=(-74.03,-73.97]
- f.dropoff_latitude=(40.58,40.69]
However, we can appreciate that in any of the different clusters the rangs that defines the cluster itself are being overlapped (which makes totally sense).
Furthermore, it also have a significative representation (almost 41%) of the rows which, this time, its total amount rang is: (11,16.6].

###Category 4
This category is determined, with a huge difference between its 2 first v.test values, for passenger variable. Concretely, 100% of their individuals are included in (1,6] rang for f.passenger.

###Category 5
This category is gathering all the rows with MTA_tax = 0 and also the 90% of the individuals which its improvement surcharge is 0 are in this category.
So MTA_tax and Improvement_surcharge explains the behaviour of category 5 rows, and 63% of individuals of this category it has its total_amount value compressed between (16.6,46].

###Category 6
Definitely, category 6 is defined by those rows which their total_amount is in the rang: (16.6,46] and their fare_amount between (14,42.5] (so , the most expensive one).
We can appreciate how 100% of this individuals have:
- MTA_tax = 0.5
- Improvement surcharge != 0


### $quanti section
We can observe how all the peculiarities pointed at the cluster analysis are proved by the quantitative variables output.
- For quanti 1, 2 and 3 the most significative quantitative variables are latitudes and longitudes
- For quanti 4 we have passenger_count at the top
- quanti 5 has a huge negative value for MTA_tax (which make sense with the description above)
- And in quanti 6 Total_amount, trip_distance and Fare_amount are distinguished as more correlated.

##Axes description
At this point, this output does not help anymore to detail our clusters. But we can also see how the interpretation of axis made in a past section corresponds to the characterization of the clusters, being the variables that explain the most each specific dimension the ones that are also distinguished for each cluster who has a higher v.test value for the dimension in question. 

```{r}

#Block B descripcion per eixos
res.hcpc$desc.axes
```

##Invidual analysis
Again, this command can help us now to confirm the conclusions made until now.
As an example, we will look a paragon of C6, and how its total_amount is served in some middle-point of the last rang (16.6,46] for this variable (total_amount = 26.3), and also look at how a distinguished C6 row has one of the possible highest values for total_amount (= 44.8).

If we keep tracking for the rest of the clusters, we can assume that the conclusions made below are concordant.

```{r}
#Block C individus
res.hcpc$desc.ind
#parangon C6
df["678042",]
#distinguished C6
df["285458",]

```

##Assigning clusters groups
Now we assign to each row the cluster group decided by HPC method and we also consider as group 7 the outliers (which they haven't been taking into consideration until now).
```{r}

#Donar-li una classe (the last one) a tots els outliers multidimensionals (sup.)
df$claHP<-7
df[row.names(res.hcpc$data.clust),"claHP"]<-res.hcpc$data.clust$clust
table(df$claHP)

```

#K-Means Classification
We execute kmeans command defining 6 clusters in order to get the same number of groups as in the hierarchical process.

```{r}

ppcc<-res.pca$ind$coord[,1:6]
dim(ppcc)
kc<-kmeans(ppcc,6,iter.max = 30, trace=T)
table(kc$cluster)
```

##Assigning clusters groups
As we also did before in HPC, we assign the clusters in a way that group 7 is taken by the outliers.
```{r}

df$claKM<-7
df[names(kc$cluster),"claKM"]<-kc$cluster
kc$betweenss/kc$totss
table(df$claKM)
```

##Characterization of Kmeans clustering
As we didn't manage to execute catdes command, we've tried to be a bit creative and search for internet. At last, we found interesting "$centers" and we realized we could try to give it a try in order to get some notion about wether the clustering done by Kmeans was similar or not to HCPC.

```{r}
kc$centers
#catdes(df,47)
#veure si s'han posat d'acord o no
table(df$claHP,df$claKM)
```

The output it's a bit messy, but we can certainly appreciate how cluster1 it's centred by Dimension 1 (which is the axis that increases with total_amount prices) and, at least, also check how Dimension 2 and cluster4 is very correlated.
These two clusters, as we can observe in the next section, are precisely associated with cluster6 and cluster4 (in this order) by the labels done in HCPC. So, even though we haven't been able to interpret the categorical description, we can predict that both methods (kmeans and hierarchical) are actually generating a very similar groups of individuals.

###Re-labeling
After all, we generate a new label for Kmeans groups so the cluster numbers are referring to the same group and avoid further confusions.

To finalize, we check the diagonal summatory of the number of invidiuals from the contingency table generated, to have an idea of the bias taken by kmeans respect to HCPC.
```{r}

df$claHP<-factor(df$claHP,labels=paste("kHP-",1:7))
df$claKM<-factor(df$claKM,levels=c(3,2,5,1,4,6,7),labels=c("kKM-3","kKM-2","kKM-5","kKM-1","kKM-4","kKM-6","kKM-7"))
tt<-table(df$claHP,df$claKM)
tt
sum(diag(tt)/sum(tt))

```