-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
571 lines (429 loc) · 27 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
---
title: "Supplemental Materials - A Review of Data Analytic Applications in Road Traffic Safety. Part 1: Descriptive and Predictive Modeling"
author:
- Amir Mehdizadeh, Ph.D.^[Department of Industrial and Systems Engineering, Auburn University. Email address [[email protected]](mailto:[email protected])]
- Miao Cai, Ph.D.^[Department of Epidemiology, School of Public Health, Sun Yat-sen University. [[email protected]](mailto:[email protected])]
- Qiong Hu, Ph.D.^[Department of Industrial and Systems Engineering, Auburn University. [[email protected]](mailto:[email protected])]
- Mohammad Ali Alamdar Yazdi, Ph.D.^[Carey Business School, Johns Hopkins University[[email protected]](mailto:[email protected])]
- Nasrin Mohabbati Kalejahi, Ph.D.^[Department of Industrial and Systems Engineering, Auburn University.[[email protected]]([email protected])]
- Aleksandr Vinel, Ph.D.^[Department of Industrial and Systems Engineering, Auburn University.[[email protected]](mailto:[email protected])]
- Steven E. Rigdon, Ph.D.^[Department of Epidemiology and Biostatistics, Saint Louis University.[[email protected]]([email protected])]
- Karen C. Davis, Ph.D.^[Department of Computer Science and Software Engineering, Miami University.[[email protected]]([email protected])]
- Fadel M. Megahed, Ph.D.^[Farmer School of Business, Miami University. Email address [[email protected]](mailto:[email protected]).]
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
code_folding: hide
self_contained: true
thumbnails: false
lightbox: false
number_sections: yes
fig_caption: yes
includes:
# after_body: footer.html
link-citations: yes
linkcolor: blue
bibliography: ref.bib
---
<style type="text/css">
body{ /* Normal */
font-size: 16px;
}
td { /* Table */
font-size: 12px;
}
h1.title {
font-size: 30px;
}
h1 { /* Header 1 */
font-size: 20px;
margin: 10px;
padding: 0;
}
h2 { /* Header 2 */
font-size: 16px;
margin: 7px;
padding: 0;
}
h3 { /* Header 3 */
font-size: 16px;
margin: 7px;
padding: 0;
}
h4 { /* Header 4 */
font-size: 16px;
margin: 10px;
padding: 5;
}
h5 { /* Header 5 */
font-size: 16px;
margin: 0px;
padding: 0;
}
code.r{ /* Code block */
font-size: 14px;
}
pre { /* Code block - determines code spacing between lines */
font-size: 14px;
}
a:link {color:#B61E2E; text-decoration:underline; font-size:100%}
a:visited {color:#B61E2E;text-decoration:underline;}
a:hover {font-size:150%;color:#B61E2E;text-decoration:underline;}
a:active {color:#B61E2E;}
</style>
```{js echo=FALSE}
$(document).ready(function(){
var $content = $(".content").hide();
$(".toggle").on("click", function(e){
$(this).toggleClass("expanded");
$content.slideToggle();
});
});
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE, cache = TRUE, message = FALSE,
fig.width = 8, fig.asp = 0.618, fig.align = "center")
#devtools::install_github("jayjacobs/ggcal")
pacman::p_load(
data.table, tidyverse, foreign, darksky, DT,
ClusterR, RColorBrewer, rmdformats, ggcal, gtools)
```
In this review paper, we attempt to provide a comprehensive review on transportation research and optimization models. This website serves as the supplementary materials to create reproducible examples in the review paper:
> Mehdizadeh, A.; Cai, M.; Hu, Q.; Alamdar Yazdi, M.A.; Mohabbati-Kalejahi, N.; Vinel, A.; Rigdon, S.E.; Davis, K.C.; Megahed, F.M. [A Review of Data Analytic Applications in Road Traffic Safety. Part 1: Descriptive and Predictive Modeling](https://www.mdpi.com/1424-8220/20/4/1107). Sensors 2020, 20, 1107.
This vignette includes examples on the following five aspects:
1. **Bibliographic analysis**: Bibliometric Summary for Transportation Safety, Bibliographic Network Matrices - Journal Names, keyword co-occurrences network, a Conceptual Structure Map
2. **Extracting online transportation safety data**: Crash-related data, traffic flow data, and weather data.
3. **An example of clustering**
4. **Statistical methods**: logistic regression and Poisson regression
To maximize the readability of this website, we hided all R codes by default, but readers can look into any code by clicking the `code` button. The data folder is not uploaded to the GitHub repository since the files exceed the limited size, but all the data are open access and interested readers can download all the files at the link given in each section.
Bibliographic analysis{.tabset .tabset-fade}
======================
***
To perform a quick bibliometric analysis on [Clarivate Analytics Web of Science (WoS)](https://clarivate.com/products/web-of-science/), we searched transportation safety publications in WoS Core Collection using the following combination of words without limiting the document type, years, language:
```
("hazmat transportation" OR "real-time crash prediction" OR ("vehicle routing" AND safety))
```
This resulted in a downloadable plain .txt file (we named it 'savedrecs.text' under data folder), containing the full records with cited references for 992 results (conducted at 7/30/2018 - 11:08 am ET). We then used the R package [`bibliometrix`](http://www.bibliometrix.org/) to conduct a bibliometric analysis on this topic.
## keyword co-occurrences
```{r warning=FALSE, eval=FALSE}
pacman::p_load(bibliometrix,tidyverse,data.table,rvest)
df.trans.safety <- readFiles("savedrecs.txt")
M <-convert2df(df.trans.safety)
NetMatrix <- biblioNetwork(
M, analysis = "co-occurrences", network = "keywords", sep = ";")
net=networkPlot(
NetMatrix, normalize="salton", weighted=NULL, n = 60,
Title = "Keyword Co-Occurrences", type = "auto",
size=15, size.cex=T, remove.multiple=TRUE,labelsize=0.75,
label.n=60,label.cex=F, cluster="optimal",edges.min = 5,
label.color = TRUE, halo = TRUE)
netstat <- networkStat(NetMatrix)
out <- capture.output(summary(netstat, k=60))
```
<center>
![A keyword co-occurrence network of the literature, depicting the 60 most used keywords.](Figures/keywords-co-occ-60words.png){width=500px}
</center>
## A conceptual structure map
Creating a Conceptual Structure Map from the Titles Using the MCA Method with terms mentioned at least 25 times in the title
```{r eval=FALSE}
CS <- conceptualStructure(
M,field="ID_TM", method="MCA", minDegree=20,
k.max=5, labelsize=15, documents = 856)
```
<center>
![A data-driven conceptual structure map based on Keywords Plus](Figures/cs.png){width=500px}
</center>
Extracting online transportation safety data
============================================
***
This section provides sources for online open-access transportation data, including both historic and real-time crash-related, traffic flow, and weathaer data. We also provided R codes to read different formats of data and convert them to compatible comma separated value (.csv) files.
## Crash-related data{.tabset .tabset-fade}
### Historical data (yearly)
Historical crash-related variables such as location, time, weather, road surface conditions, vehicles/ pedestrians involved in a crash, crash outcomes and/or root-causes can be extracted from these two databases.
- Federal Motor Carrier Safety Administration (**FMCSA**) | all truck crashes
- National Highway Traffic Safety Administration (**NHTSA**) | all fatal crashes
#### Example data{- .tabset .tabset-fade}
##### FMCSA 1{-}
The Motor Carrier Crash Information can be downloaded on Federal Motor Carrier Safety Administration ([FMCSA](https://ai.fmcsa.dot.gov/SMS/Tools/Downloads.aspx)). This database contains information about the location, vehicle and severity of truck crashes (description about each feature can be found [here](data/Crash_Readme.txt)). We downloaded the FMCSA Motor Carrier Crash data on January 15, 2019 and they were collected between December 2016 and December 2018. The Following table and graph show the truck crashes information in Missouri, 2017.
```{r fig.width=6,fig.cap='The number of motor carrier crashes by year and month in Missouri 2017',fig.align='center'}
fmcsa = data.table::fread("data/Crash_2018Dec/2018Dec_crash.txt")
fmcsa[, REPORT_DATE := lubridate::dmy(REPORT_DATE)]
fmcsa[, ymonth:= paste0(substr(REPORT_DATE, 1, 4), substr(REPORT_DATE, 6, 7))]
fmcsamo = fmcsa[REPORT_STATE == "MO"& year(REPORT_DATE) == 2017]
fmcsamo %>%
ggplot(aes(x = ymonth)) + geom_bar(fill = "lightblue") +
xlab("Year and month") + ylab("The number of motor carrier crashes") +
theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
##### FMCSA 2{-}
```{r}
datatable(
head(fmcsamo),
caption = 'All truck crashes in Missouri, 2017',
class = 'cell-border stripe',
extensions = c('Buttons', 'ColReorder'),
options = list(
dom = 'Bfrtip', colReorder = TRUE, scrollX = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
)) %>%
formatStyle( 0, target= 'row', lineHeight='50%')
```
For the mentioned time period, there were `r dim(fmcsa)[1]` observations and `r dim(fmcsa)[2]` variables in the data. Further, the original format of this database is text (.txt). For more convenience, one can convert the format of this database to comma-separated values(.csv).
```{r}
data.table::fwrite(fmcsa, "data/fmcsa.csv")
```
##### NHTSA 1{-}
All fatal crashes from 1975 to 2017 (FARS database) can be downloaded at National Highway Traffic Safety Administration (NHTSA) [official site](ftp://ftp.nhtsa.dot.gov/fars/). FARS dataset provides information about fatal crashes in three aspects: crashes, vehicles and people. The names of all available .csv files are listed below.
```{r}
csvpath = './data/FARS2017NationalCSV/'
csvname = list.files(path = csvpath, pattern = "*.csv")
csvpathname = paste0(csvpath, csvname)
nhtsa = purrr::map(csvpathname, fread) %>%
set_names(gsub(".csv", "", csvname))
csvname
```
##### NHTSA 2{-}
There is an exclusive database from 2000 to 2010, for trucks involved in fatal crashes (TIFA) which is built on FARS database and can be downloaded [here](http://www.cmisst.org/tools-resources/database-resources/dataset-collection/). FARS has provided data in csv format However, TIFA has published data in sas7bdat format. For more convienece, the provided code converts TIFA database to csv format files.
```{r}
tifa2010 = haven::read_sas("data/TIFA_2010/TIFA_2010.sas7bdat")
data.table::fwrite(tifa2010, "data/TIFA_2010/tifa2010.csv")
datatable(
head(tifa2010),
caption = 'Fatal truck crashes provided by the TIFA, 2010',
class = 'cell-border stripe',
extensions = c('Buttons', 'ColReorder'),
options = list(
scrollX = TRUE, dom = 'Bfrtip', colReorder = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
)) %>%
formatStyle( 0, target= 'row', lineHeight='50%')
```
### Real-time data (<= 1 hour)
Most of the states in the USA have activated 511 system. Basically, this system provides real-time travel information such as traffic, road and weather conditions. Currently this system provides accident/incident cases in a visual format. The following figure displays a screenshot of [Pennsylvania 511](https://www.511pa.com/website).
[![Pennsylvania 511 system, providing real-time information about accidents/incidents.](Figures/511pa.png)](https://www.511pa.com/website)
## Traffic flow data{.tabset .tabset-fade}
### Historical data (yearly){-}
[FHWA](https://www.fhwa.dot.gov/policyinformation/hpms/shapefiles.cfm) has provided Annual Average Daily Traffic (AADT) from 2011 to 2017. As an illustration, the following code chunk displays the first five observations of AADT data for Missouri 2017.
```{r FHWA}
# Using Missouri as an example
fhwai = foreign::read.dbf("data/missouri2017/Missouri2017.dbf")
datatable(
head(fhwai),
caption = 'Historical traffic data in Missouri, 2017',
class = 'cell-border stripe',
extensions = c('Buttons', 'ColReorder'),
options = list(
dom = 'Bfrtip', colReorder = TRUE, scrollX = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))
```
The downloaded “shape files” can be converted to different data formats (e.g., .csv) using the following R code.
```{r}
data.table::fwrite(fhwai, "data/missouri2017/fhwai.csv")
```
### Real-time data (<= 5 minutes){-}
There are several sources for getting real-time traffic data. Some of the states in the USA are equipped with loop detectors and video cameras. Departments of Transportation (DoT) can provide this data.
Further, [HERE](https://www.here.com/) website also provides near real-time traffic data with the limitation of 250,000 APIs per month for free. The [HERE Traffic API ](https://developer.here.com/documentation/traffic/topics/what-is.html) provides traffic flow and incidents information. It also allows the users to request traffic map tiles. The HERE Traffic API provides four types of traffic data:
- Traffic Incident Data: the type and location of each traffic incident, status, start and end time, and other relevant data
- Traffic Map Tile Overlays (Traffic Tiles): pre-rendered map tile overlays with traffic information
- Traffic Flow Data: real-time traffic flow data, including speed, congestion, geometry of the road segments
- Traffic Flow Availability: traffic flow information, excluding incidents in an area
The HERE API do not provide a package/interface in R. Queries data from HERE API in R may need messy and highly customized code, so we do not provide R code here.
## Weather data {.tabset .tabset-fade}
### Real-time (minutes){.tabset .tabset-fade}
#### The DarkSky API {-}
In this part, we attempt show getting both historical and real-time weather data using the [DarkSky API](https://darksky.net/dev/docs/libraries). It can be used in both [Python](https://github.com/bitpixdigital/forecastiopy3) and [R](https://github.com/hrbrmstr/darksky). Before using the DarkSky API to get weather data, you need to register for a API key on [its official website](https://darksky.net/dev/register). The first 1000 API requests you make each day are free, but each API request over the 1000 daily limit will cost you $0.0001, which means a million extra API requests will cost you 100 USD.
To get weather data from the DarkSky API, you need to provide 1) latitude, 2) longitude, 3) date and time. Then you can pass these three parameters to the `get_forecast_for()` function in [`darksky`](https://github.com/hrbrmstr/darksky) package in R. For each observation (a combination of latitude, longitude, date and time), the darksky API returns a list of 3 data.frames:
1. hourly weather. 24 hourly observations for each 15 weather variables in that day.
2. daily weathe. 1 observations for each 34 weather variables in that day.
3. current weather. 1 observations for each 15 weather variables at the assigned time point.
The variables include: apparent (feels-like) temperature, atmospheric pressure, dew point, humidity, liquid precipitation rate, moon phase, nearest storm distance, nearest storm direction, ozone, precipitation type, snowfall, sun rise/set, temperature, text summaries, uv index, wind gust, wind speed, wind direction
#### Sample data{-}
```{r}
source("private/DarkSkyAPIkey.R")
Sys.setenv(DARKSKY_API_KEY = myDarkSkyAPIkey) # you need to use your own "myDarkSkyAPIkey"
dat = structure(list(
latitude = c(41.3473127, 41.8189037, 32.8258477, 40.6776808, 40.2366043),
longitude = c(-74.2850908, -73.0835104, -97.0306677, -75.1450753, -76.9367494),
time = structure(c(1453101738, 1437508088, 1436195038, 1435243088, 1454270680),
class = c("POSIXct", "POSIXt"), tzone = "UTC")),
row.names = c(NA, -5L), class = "data.frame"
)
weather_dat <- pmap(
list(dat$latitude, dat$longitude, dat$time),
get_forecast_for)
datatable(
dat,
caption = 'Sample data to demonstrate the DarkSky API',
class = 'cell-border stripe',
extensions = c('Buttons', 'ColReorder'),
options = list(
dom = 'Bfrtip', colReorder = TRUE, scrollX = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))%>%
formatStyle( 0, target= 'row', lineHeight='80%')
```
#### currently{-}
```{r}
datatable(
head(weather_dat[[1]]$currently),
caption = 'Current weather provided by the DarkSky API',
class = 'cell-border stripe',
extensions = c('Buttons', 'ColReorder'),
options = list(
dom = 'Bfrtip', colReorder = TRUE, scrollX = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))%>%
formatStyle( 0, target= 'row', lineHeight='80%')
```
#### hourly{-}
```{r}
datatable(
head(weather_dat[[1]]$hourly),
caption = 'Hourly weather provided by the DarkSky API',
class = 'cell-border stripe',
extensions = c('Buttons', 'ColReorder'),
options = list(
dom = 'Bfrtip', colReorder = TRUE, scrollX = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))%>%
formatStyle( 0, target= 'row', lineHeight='80%')
```
#### daily{-}
```{r}
datatable(
head(weather_dat[[1]]$daily),
caption = 'Daily weather provided by the DarkSky API',
class = 'cell-border stripe',
extensions = c('Buttons', 'ColReorder'),
options = list(
dom = 'Bfrtip', colReorder = TRUE, scrollX = TRUE,
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
))%>%
formatStyle( 0, target= 'row', lineHeight='80%')
```
### Historical (daily)
[NOAA](https://www.ncdc.noaa.gov/)
A clustering example {.tabset .tabset-fade}
====================
***
The following codes attempts to replicate the visual clustering approach by @van1999cluster
> Van Wijk, Jarke J., and Edward R. Van Selow. 1999. "Cluster and Calendar Based Visualization of Time Series Data." In Information Visualization, 1999.(Info Vis' 99) Proceedings. 1999 IEEE Symposium on, 4-9. IEEE.
A brief example of applying EDA methods on traffic data is provided here. The goal of this example is to illustrate the efficiency of the mentioned tools in the transportation context. There is no predetermined way to utilize these methods. The efficiency of each method highly depends on the nature of the problem. Hence, the challenge is to choose the right tool which fits the best.
## Collecting Data
Hourly vehicle counts data is used in this example. It provides the number of vehicles which passed along a particular segment of a road in one hour. Data is extracted from the Georgia Department of Transportation (GDoT) (Georgia Department of Transportation, 2015) for 2015 from station 121-5505 which located in Atlanta. GDoT provides data in separate sheets for each month. After extracting and cleaning data, it was combined to one sheet with 365 rows (days) and 24 columns (hours). Data can be downloaded from [GDoT](http://geocounts.com/gdot/).
## Clustering
It is almost impossible to understand raw data and also discover interesting patterns in it by just looking at 8760 (370 * 24) data cells. Hence, K-means clustering method is utilized here to present data in a more understandable format. K-means clustering is a common technique to explore data and discover patterns by grouping similar data to predefined (k) number of clusters. K-means clustering aims to group data into k clusters in a way to minimize the within-cluster sum of squares (WCSS). To find the optimal number of clusters, we have used a method that was suggested by @pham2005selection. According to the following graph two is the best number of clusters to group this data.
```{r}
trafficflow.df = "data/georgia-TFdata-station-121-5505-Yr2015.csv" %>%
readr::read_csv() %>%
mutate(Date = lubridate::dmy(paste0(Date, '-2015')))
opt = Optimal_Clusters_KMeans(
as.data.frame(trafficflow.df[,4:27]), max_clusters = 10,
plot_clusters = T, criterion = 'distortion_fK', fK_threshold = 0.85,
initializer = 'optimal_init', tol_optimal_init = 0.2,
max_iters = 10000)
num_clusters <- which.min(opt) # Based on the results, we should use k=2 clusters in kmeans
km = KMeans_arma(
as.data.frame(trafficflow.df[,4:27]), clusters = num_clusters,
n_iter = 10000, seed_mode = "random_subset", verbose = F, CENTROIDS = NULL)
pr = predict_KMeans(data.frame(trafficflow.df[,4:27]), km)
trafficflow.df$cluster.num <- as.vector(pr) %>% as.factor()
table(trafficflow.df$cluster.num)
```
## Visualization
Now, k-means clustering can be applied. The output of this step is a column which its value is either one or two,indicating that each row of data (day) belongs to cluster one or two. Now data is divided into two groups. However, still we need to transfer data to a visual format to somehow validate and guide the clustering process. Since our data contains temporal information, we have used Cluster Calendar View visualization technique which is introduced by @van1999cluster. In this technique, a calendar represents the temporal information of data and by using color coding, differences between clusters are distinguished. The following graph shows a cluster calendar view for our data. It clearly has found meaningful patterns in the vehicle counts data. Weekends and weekdays have different traffic patterns. Besides, it has captured some of the holidays. For example, the 4th of July (Independence Day) which is a weekday, is colored by light blue. It means that this day has a similar traffic pattern with weekends. In addition, the clustering method has identified other holidays like Martin Luther King Day, Memorial Day, Labor Day, Thanksgiving Day and Christmas Day.
```{r}
Sys.setlocale(locale="English")
col.brewer.pal <- brewer.pal(11, "Paired")
p2 <- ggcal(trafficflow.df$Date, trafficflow.df$cluster.num) +
theme(legend.position="top") +
scale_fill_manual(values = c("1"=col.brewer.pal[1], "2"=col.brewer.pal[2]))
p2
```
Furthermore, a line chart (following graph) is used to show the average hourly traffic data for the two clusters. Results show that each cluster has different peaks and valleys. On the weekdays, 7 AM and 4 PM have the greatest number of vehicles which can be explained by the official working hours. On the other hand, on weekends, the traffic peak is around 1 PM which maybe refers to some people going out for lunch.
```{r}
summary.df <- group_by(trafficflow.df,cluster.num)
summary.df <- summarise_all(summary.df,funs(mean))
plot.df <- subset(summary.df, select = -c(2:4))
plot.df <- melt(plot.df, value.name="Traffic.Flow",
variable.name="Hour",id.vars="cluster.num")
plot.df$cluster.num <- as.factor(plot.df$cluster.num)
p1 <- plot.df %>%
ggplot(aes(x = Hour, y = Traffic.Flow, group=cluster.num, color=cluster.num)) +
geom_line(size=2) + theme_bw() +
theme(legend.position="top", axis.text.x=element_text(angle=90, hjust=1)) +
scale_color_brewer("Paired")
p1
```
To sum up, it seems that K-means clustering method was very efficient here. We applied raw data as inputs to this method and as outputs we could discover patterns (weekdays and weekends traffic patterns) and also with the help of visualization technique we obtained a considerable information about the data.
Statistical models{.tabset .tabset-fade}
==================
***
Logistic regression and Poisson regression are two most commonly used statistical models in crash risk prediction studies. In this section, we provides introductory statistical theories on the two models, and
## Logistic regression {.tabset .tabset-fade}
### Theory {-}
The most common model for crash risk prediction studies is logistic regression, where the response is $Y_i=1$ if a crash occurred in a given segment/time period, and $Y_i=0$ if no crash occurred. In logistic regression we assume that the logit of the probability of a traffic crash is a combination of a linear predictor variables:
\[
Y_i \sim \text{Bernoulli}(p_i)\\
\text{logit}(p_i) = \log \left( \frac{p_i}{1-p_i}\right)
= \textbf{X}^{\prime} \boldsymbol{\beta}
\]
Here
\[
\textbf{X}^{\prime} \boldsymbol{\beta}
= \beta_0 + \sum_{j=1}^{p} x_{ij} \beta_j
\]
where the $x_{ij}$ are covariates. A simple linear regression can be conducted using the following R code.
### R code{-}
```{r }
# simulate data
set.seed(1853)
n = 10000
fatigue = rnorm(n, 3, 1)
precipitation = rbeta(n, 1, 5)
traffic = rgamma(n, 5, 1)
p = gtools::inv.logit(-5 + 0.5*fatigue + 0.2*traffic + 0.5*precipitation)
Y = rbernoulli(n, p)
logit_dat = data.frame(Y, fatigue, traffic, precipitation)
# logistic regression
logit_fit <- glm(
Y ~ fatigue + traffic + precipitation,
family = "binomial", data = logit_dat)
summary(logit_fit)
```
For a detailed explanation on interpreting the results of a logistic regression, the readers can refer to [the example by the UCLA](https://stats.idre.ucla.edu/r/dae/logit-regression/).
## Poisson regression {.tabset .tabset-fade}
### Theory{-}
Another widely used model for crash risk prediction studies is Poisson regression, where the response $Y_i$ is the number of crashes.
\[
Y_i \sim \text{Poisson}(T*\lambda)\\
\log\lambda = \textbf{X}^{\prime} \boldsymbol{\beta}
\]
Here $T$ is the length/time of a trip. It is an exposure variable that indicates the number of crashes could have occurred.
### R code{-}
```{r}
# simulate data
set.seed(123)
n = 500
fatigue = rnorm(n, 3, 1)
precipitation = rbeta(n, 1, 5)
traffic = rgamma(n, 5, 1)
trip = rnorm(n, 6, 1)
explambda = exp(-5 + 0.5*fatigue + 0.2*traffic + 0.5*precipitation)
Y = rpois(n, lambda = explambda*trip)
pois_dat = data.frame(Y, trip, fatigue, traffic, precipitation)
# Poisson regression
poisson_fit <- glm(
Y ~ fatigue + traffic + precipitation,
offset = trip, family = "poisson", data = pois_dat)
summary(poisson_fit)
```
For a detailed explanation on interpreting the results of a Poisson regression, the readers can refer to [the example by the UCLA](https://stats.idre.ucla.edu/r/dae/poisson-regression/).
# Optimization
Since optimization code is mostly Python-based, we do not provide code on the optimization section here. Should the readers have interest in this part, they can contact our optimization collaborators or refer to [the supplementary materials for the part 2 of our review](https://caimiao0714.github.io/optimization_stats_case_study/). The part 2 of the review (optimization section) is:
> Hu, Q.; Cai, M.; Mohabbati-Kalejahi, N.; Mehdizadeh, A.; Alamdar Yazdi, M.A.; Vinel, A.; Rigdon, S.E.; Davis, K.C.; Megahed, F.M. [A Review of Data Analytic Applications in Road Traffic Safety. Part 2: Prescriptive Modeling](https://www.mdpi.com/1424-8220/20/4/1096). Sensors 2020, 20, 1096
# Acknowledgement{-}
This work was supported in part by the National Science Foundation (CMMI-1635927 and CMMI-1634992), the Ohio Supercomputer Center (PMIU0138 and PMIU0162), the American Society of Safety Professionals (ASSP) Foundation, the University of Cincinnati Education and Research Center Pilot Research Project Training Program, and the Transportation Informatics Tier I University Transportation Center (TransInfo). We also thank the DarkSky API for providing us five million free calls to their weather database.
# References {#references .unnumbered}