-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathProcess_Analysis.Rmd
250 lines (189 loc) · 8.81 KB
/
Process_Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "Process_Analysis"
output:
pdf_document: default
html_notebook: default
---
This is R Notebook present a process analysis of 746 completed, fully validated, and archived projects in the HOT Tasking Manager (HOT-TM) over the past two years. The process is analised from four perspectives:
0) [Project profiling](#projects)
1) [Control flow](#controlflow)
2) [Time](#time)
3) [Organisation](#organisation)
4) [Outcome](#outcome)
Process discovery was performed using bupaR, a suite of open-source R packages for business process data analysis.
```{r}
# Import required libraries
suppressWarnings({
#install.packages("reshape2")
#install.packages("gt")
library(bupaverse)
library(reshape2)
library(gt)
library(scales)
library(readr)
library(dplyr)
library(magrittr)
library(ggplot2)
library(Hmisc)
library(gamlss)
})
```
# Read event data
The log containing only initial tasks ("initial_tasks.csv") is used in 1) Control flow, 2) Time, 3) Organisation sections, where a clean view of the process is required.
```{r}
event_log_df <- read.csv("initial_tasks.csv", stringsAsFactors = FALSE, sep = ",")
event_log_df <- event_log_df %>%
convert_timestamps(columns = c("start", "complete"), format = ymd_hms) %>%
activitylog(case_id = "taskId", activity_id = "action", resource_id = "actionBy", timestamps = c("start", "complete"))
```
```{r}
head(event_log_df)
```
# 0. Project profiling {#projects}
Read list of projects to be analised
```{r}
projects <- read.csv("projects.csv", stringsAsFactors = FALSE, sep = ",")
head(projects)
```
Frequency of projects according to their level of difficulty
```{r}
difficulty <- projects %>% count(difficulty) %>% mutate(percentage = n/sum(n)*100)
difficulty[order(difficulty$n, decreasing = TRUE),]
```
Frequency of projects according to their hubs
```{r}
hub <- projects %>% count(region) %>% mutate(percentage = n/sum(n)*100)
hub[order(hub$n, decreasing = TRUE),]
```
# 1. Control flow {#controlflow}
Absolute frequency of activities in the eventlog.
```{r}
event_log_df %>% activity_frequency("activity")
```
Activity presence shows in what percentage of cases an activity is present.
```{r}
event_log_df %>% activity_presence()
```
The start of cases can be described using the start_activities function.
```{r}
event_log_df %>% start_activities("activity")
```
The end_activities function describes the end of cases
```{r}
event_log_df %>% end_activities("activity")
```
In the frequency process map, nodes represent the absolute number of activity instance executions and edges represent the absolute number of times source and target activities were executed directly following each other. To provide a clear process map, the event log was previously filter using filter_trace_frequency(). Setting percentage = 0.95 selects at least 95% of the cases, starting with those that have the highest frequency.
```{r}
tmp <- event_log_df %>% filter_trace_frequency(percentage = 0.95)
tmp %>% process_map(frequency("absolute"))
```
trace_explorer() with coverage argument n_traces = 3 shows the 3 most frequent in the event log.
```{r}
event_log_df %>% trace_explorer(n_traces = 3, show_labels = FALSE, coverage_labels = c("relative"))
```
# 2. Time {#time}
In the temporal process map, the value of nodes and edges represent the median duration in days of activities and waiting times.
```{r}
tmp1 <- tmp
tmp1 %>% process_map(performance(median, "days"))
```
Timestamps are properly formatted for time calculations
```{r}
event_log_df <- event_log_df %>% convert_timestamps(columns = c("start", "complete"), format = ymd_hms)
event_log_df$time_diff=as.numeric(event_log_df$complete-event_log_df$start)
task_duration <- event_log_df %>% group_by(taskId) %>% summarise(min = min(start), max= max(complete))
task_duration$duration=as.numeric(task_duration$max-task_duration$min)
task_duration <- task_duration[,c("taskId","duration")]
```
We calculated the relative time devoted for mapping activities ('LOCKED_FOR_MAPPING', 'AUTO_UNLOCKED_FOR_MAPPING') and validation activities ('LOCKED_FOR_VALIDATION','AUTO_UNLOCKED_FOR_VALIDATION') per case expressed as percentage of total case duration. The remaining time is considered idle.
```{r}
mapping_duration <- filter(event_log_df, action == 'LOCKED_FOR_MAPPING' | action == 'AUTO_UNLOCKED_FOR_MAPPING') %>% group_by(taskId) %>% summarise(mapping = sum(time_diff))
validation_duration <- filter(event_log_df, action == 'LOCKED_FOR_VALIDATION' | action == 'AUTO_UNLOCKED_FOR_VALIDATION') %>% group_by(taskId) %>% summarise(validation = sum(time_diff))
durations <- merge(x = merge(x = task_duration, y = mapping_duration, by = "taskId", all.x = TRUE), y = validation_duration, by = "taskId", all.x = TRUE)
durations[is.na(durations)] <- 0
durations$mapping_per=durations$mapping/durations$duration*100
durations$validation_per=durations$validation/durations$duration*100
durations$service=durations$mapping_per+durations$validation_per
durations$iddle_per=100-durations$mapping_per-durations$validation_per
```
Median % of iddle time per case
```{r}
median(durations$iddle_per, na.rm = TRUE )
```
Median % of mapping and validation time per case
```{r}
median(durations$service, na.rm = TRUE )
```
Median % of mapping time per case
```{r}
median(durations$mapping_per, na.rm = TRUE )
```
Median % of validation time per case
```{r}
median(durations$validation_per, na.rm = TRUE )
```
# 3. Organisation {#organisation}
The contributor profile ("contributors.csv") is added to the event log to calculate the relative frequency with which contributors according to their mapping level execute each type of activity.
```{r}
contributors <- read.csv("contributors.csv", stringsAsFactors = FALSE, sep = ",")
event_log_df <- merge(event_log_df, contributors, by.x = "actionBy", by.y = "username")
head(event_log_df)
```
Composition of the total number of contributors of the analysed projects according to their mapping level.
```{r}
mappingLevel <- event_log_df %>% group_by(mappingLevel) %>% summarise(count = n_distinct(actionBy))
mappingLevel$percentage <- round(mappingLevel$count/sum(mappingLevel$count)*100,1)
mappingLevel
```
Breakdown of status execution frequency per mapping level.
```{r}
data_pivot <- dcast(event_log_df, action ~ mappingLevel,value.var = "taskId", length)
data_pivot$sum <- data_pivot$ADVANCED + data_pivot$BEGINNER + data_pivot$INTERMEDIATE
data_pivot$ADVANCEDper <- round(data_pivot$ADVANCED/data_pivot$sum*100,1)
data_pivot$BEGINNERper <- round(data_pivot$BEGINNER/data_pivot$sum*100,1)
data_pivot$INTERMEDIATEper <- round(data_pivot$INTERMEDIATE/data_pivot$sum*100,1)
data_pivot[c("action","ADVANCEDper","BEGINNERper","INTERMEDIATEper")] %>% gt() %>% data_color(columns = 2:4, colors = col_numeric(palette = c("white","darkgreen"),domain = c(0,100)))
```
# 4. Outcome {#outcome}
The regression.csv file contains the total list of tasks with columns describing either the absolute frequency of executions of each type of activity ("activityname") or a binary flag (1/0) indicating whether or not the activity was executed ("activityname1") on that task.
```{r}
regression <- read.csv("regression.csv", stringsAsFactors = FALSE, sep = ",")
regression$projId=as.character(regression$projId)
regression$percentage_area_covered_by_building = regression$percentage_area_covered_by_building / 100
#A few records have insignificant negative values attributed to rounding
regression$percentage_area_covered_by_building[regression$percentage_area_covered_by_building < 0] <- 0
regression = na.omit(regression)
```
Regression coefficients from the [GAMLSS-BEZI](https://www.rdocumentation.org/packages/gamlss.dist/versions/6.1-1/topics/BEZI) model
```{r}
model <- gamlss(percentage_area_covered_by_building ~ splits1 + invalidations1 + locked_for_mappings + badimagery1 + area_sqm + difficulty, family = BEZI, data = regression, trace = F)
summary(model)
```
Dropping model to single model terms
```{r}
drop1(model, parallel = "multicore", ncpus = 4)
```
Describe percentage_area_covered_by_building for all projects
```{r}
regression$percentage_area_covered_by_building = regression$percentage_area_covered_by_building * 100
describe(regression$percentage_area_covered_by_building)
sd(regression$percentage_area_covered_by_building, na.rm=TRUE)
```
Display box-plot of percentage_area_covered_by_building for all projects
```{r}
p<-ggplot(regression, aes(y=percentage_area_covered_by_building)) +
geom_boxplot()
p + theme_classic()
```
Describe percentage_area_covered_by_building by split category
```{r}
regression %>% group_by(splits1) %>% summarise(projId = n())
tapply(regression$percentage_area_covered_by_building, regression$splits1, summary)
```
Display box-plot of percentage_area_covered_by_building by split category
```{r}
regression$splits1=as.character(regression$splits1)
p<-ggplot(regression, aes(x=splits1, y=percentage_area_covered_by_building, color=splits1)) +
geom_boxplot()
p + theme_classic()
```