-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLab25Maig2018_StatModellingY_GLMz.Rmd
159 lines (118 loc) · 5.19 KB
/
Lab25Maig2018_StatModellingY_GLMz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: "Session 9: NY Cabs Data - AnyTip Modelling - GLMz"
author: "Lidia Montero"
date: "May,25th 2018"
output:
html_document:
toc: true
toc_depth: 3
number_sections: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
# Introduction
## Data Description
Description http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Data Dictionary - SHL Trip Records -This data dictionary describes SHL trip data in visit http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml:
- VendorID A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
- lpep_pickup_datetime The date and time when the meter was engaged.
- lpep_dropoff_datetime The date and time when the meter was disengaged.
- Passenger_count The number of passengers in the vehicle. This is a driver-entered value.
- Trip_distance The elapsed trip distance in miles reported by the taximeter.
- Pickup_longitude Longitude where the meter was engaged.
- Pickup_latitude Latitude where the meter was engaged.
- RateCodeID The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
- Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka "store and forward," because the vehicle did not have a connection to the server: Y= store and forward trip N= not a store and forward trip
- Dropoff_longitude Longitude where the meter was timed off.
- Dropoff_ latitude Latitude where the meter was timed off.
- Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
- Fare_amount The time-and-distance fare calculated by the meter.
- Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
- MTA_tax $0.50 MTA tax that is automatically triggered dfd on the metered rate in use. - Improvement_surcharge $0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.
- Tip_amount This field is automatically populated for credit card tips. Cash tips are not included.
- Tolls_amount Total amount of all tolls paid in trip.
- Total_amount The total amount charged to passengers. Does not include cash tips.
- Trip_type A code indicating whether the trip was a street-hail or a dispatch that is automatically assigned dfd on the metered rate in use but can be altered by the driver.
1= Street-hail 2= Dispatch
## Load Required Packages
```{r, echo=FALSE, include=FALSE}
rm(list=ls())
# Load Required Packages: to be increased over the course
requiredPackages <- c("effects","FactoMineR","car", "factoextra","RColorBrewer","ggplot2","missMDA","mvoutlier","dplyr","ggmap","ggthemes","knitr","MVA")
missingPackages <- requiredPackages[!(requiredPackages %in% installed.packages()[,"Package"])]
if(length(missingPackages)) install.packages(missingPackages)
lapply(requiredPackages, require, character.only = TRUE)
```
# Statistical Modelling
## Load your sample after data cleaning and validation (Deliverable 1)
```{r}
#setwd("E:/DOCENCIA/FIB-ADEI/PRACTICA/NYCABS/LABS") # Choose your own directory
#file_path <- "E:/DOCENCIA/FIB-ADEI/PRACTICA/NYCABS/LABS/"
#load(paste0(file_path,"MyTaxi5000_Modeling.RData"))
names(df)
summary(df)
vars_con
vars_dis
vars_res
m1<-glm(AnyTip~poly(trip_distance_km,2),data=df,family=binomial)
maux<-glm(AnyTip~Total_amount+Dropoff_longitude+,data=df,family=binomial)
m1b<-glm(AnyTip~trip_distance_km+I(trip_distance_km^2),data=df,family=binomial)
summary(m1b)
Anova(m1,test="Wald")
marginalModelPlot(m1)
# Quines?
names(df)
catdes(df,num.var=22)
# Feature Selection
table(df$AnyTip,df$Payment_type)
```
# Generalized Linear Regression issues
*Target binary factor AnyTip*
```{r}
# Split your sample: work and test
ll<-which(df$Payment_type=="Cash");length(ll)
df<-df[-ll,]
set.seed(1234)
llwork<-sample(1:nrow(df),0.70*nrow(df),replace=FALSE)
llwork<-sort(llwork);length(llwork)
dfwork<-df[llwork,]
dftest<-df[-llwork,]
table(df$Extra)
table(df$MTA_tax)
df$f.extra<-0
df$f.extra[df$Extra>0]<-1
df$f.extra[df$Extra>0.5]<-2
df$f.extra<-factor(df$f.extra,labels=c("Extra-No","Extra-0.5","Extra-1"))
df$f.tax<-ifelse(df$MTA_tax>0,1,0)
df$f.tax<-factor(df$f.tax,labels=c("TAX-No","TAX-Yes"))
vars_con
catdes(df[,c(vars_con,"AnyTip")],num.var=18)
names(df)
vars_cexp<-vars_con[c(1,3,9,12,8)]
#vars_cexp<-df[c(18,8,6,24)]
```
# Modelling AnyTip using numeric explanatory variables
```{r}
m50<-glm(AnyTip~.,family="binomial",data=dfwork[,c("AnyTip",vars_cexp)])
summary(m50)
Anova(m50,test="Wald")
vif(m50)
vars_con
vars_cexp<-vars_con[c(1,3,9,12,8,15:17)]
m50<-glm(AnyTip~.,family="binomial",data=dffwork[,c("AnyTip",vars_cexp)])
summary(m50)
Anova(m50,test="Wald")
vif(m50)
#Después de molta feina... m73
m73<-m51
```
#Diagnostics
```{r}
#Boxplot dels residus
Boxplot(rstudent(m73), id.n=15)
llout<-which(abs(rstudent(m73))>2);length(llout)
dfwork[llout,]
```