-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPaycheck Protection Program.Rmd
386 lines (287 loc) · 10.8 KB
/
Paycheck Protection Program.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
---
title: 'STA1040: Group_Assignment_2'
author: "Group 1"
date: "2024-10-28"
output: html_document
---
```{r setup, include=FALSE}
# Load knitr package to set global chunk options
knitr::opts_chunk$set(warning = FALSE)
```
```{r}
# install.packages("tidyverse")
# load the packages
library(tidyverse)
library(lubridate)
library(gglpot2)
library(dplyr)
```
```{r}
# options(timeout = 300) # Increase timeout to 300 seconds (5 minutes)
# download.file("https://dxl-datasets.s3.amazonaws.com/data/ppp_fl.csv", destfile = "ppp_fl.csv")
```
```{r}
# Load the downloaded CSV file
data <- read_csv("ppp_fl.csv")
```
```{r}
# viewing whole data in a new tab
view(data)
```
```{r}
# Dimension of dataframe
dim(data)
```
```{r}
# The first few rows
head(data)
```
```{r}
# The last few rows
tail(data)
```
```{r}
# To know all of our column names
colnames(data)
```
**Data cleaning**
```{r}
# Checking each variable type
glimpse(data)
```
As you can see from the above the data types of the variables: Date, Amount, AmountForgiven, BusinessAge, BusinessType need some modifications.
```{r}
# Changing the data type of the variable and setting a format
data$Date <- as.Date(data$Date, format = "%m/%d/%Y")
# Checking the change
class(data$Date)
```
```{r}
# Changing data type of the 2 variables to numeric
# But before converting them - we have to remove the $ and , sign
data$Amount <- gsub("[$,]", "", data$Amount)
data$AmountForgiven <- gsub("[$,]", "", data$AmountForgiven)
glimpse(data)
```
```{r}
# Changing the data types
data <- data %>%
mutate(Amount = as.numeric(data$Amount),
AmountForgiven = as.numeric(data$AmountForgiven))
# Renaming the columns to indicate the currency is in dollars
data <- data %>%
rename(Amount_in_Dollar = Amount, AmountForgiven_in_Dollar = AmountForgiven)
# Checking the change
class(data$Amount_in_Dollar)
class(data$AmountForgiven_in_Dollar)
```
For BusinessAge, and BusinessType we have to check whether we want them to be factor data types.
```{r}
# Extracting unique values using unique(), and lapply() to appply the function to all the elements in the list
unique_values <- lapply(list(BusinessAge = data$BusinessAge,
BusinessType = data$BusinessType), unique)
# Print the unique values
print(unique_values)
```
```{r}
# Changing the data type of both variables in the above to factor
data <- data %>%
mutate(BusinessAge = as.factor(BusinessAge),
BusinessType = as.factor(BusinessType))
class(data$BusinessAge)
class(data$BusinessType)
```
```{r}
glimpse(data)
```
```{r}
# checking for missing Values
anyNA(data)
```
```{r}
#Checking for duplicated values
anyDuplicated(data)
```
```{r}
# Returns a logical vector indicating duplicated rows
duplicates <- duplicated(data)
# Show only duplicated rows
data[duplicates, ]
```
**Breaking down the Borrower Address Column**
This allows for analysis of geographic distribution, grouping by state or city, and identifying regional trends
```{r}
# Split BorrowerAddress into street, city, and a combined state/zip column
data <- data %>%
separate(BorrowerAddress, into = c("BorrowerStreet", "BorrowerCity", "BorrowerStateZip"), sep = ",\\s*", extra = "merge", fill = "right") %>%
separate(BorrowerStateZip, into = c("BorrowerState", "BorrowerZip_Code"), sep = "\\s", extra = "merge", fill = "right")
# Preview the resulting dataset with the new address columns
View(data)
```
```{r}
# Ensure ZIP codes are in the correct format (5 digits)
data <- data %>%
mutate(BorrowerZip_Code = substr(BorrowerZip_Code, 1, 5))
```
```{r}
glimpse(data)
```
**Breaking down the Date Column**
This breakdown is important because having separate columns for year, month, and day enables us to analyze data at a more granular level, such as trends over time or seasonal effects.
```{r}
data <- data %>%
mutate(Date = ymd(Date), # Converting the Date to proper date format (MM/DD/YYYY)
Year = year(Date), # Extracting the year
Month = month(Date), # Extracting the month
Day = day(Date)) # Extracting the day
# View the resulting dataset
View(data)
```
**Breaking down the Lender Address Column**
By separating components like street, city, state, and zip code, you can analyze each aspect individually. For example, you can focus on city-wise trends, which might not be apparent when the addresses are combined.
```{r}
# Separate LenderAddress into multiple components and add them as new columns
data <- data %>%
separate(LenderAddress, into = c("LenderStreet", "LenderCity", "LenderStateZip"), sep = ",\\s*", extra = "merge", fill = "right") %>%
separate(LenderStateZip, into = c("LenderState", "LenderZipCode"), sep = "\\s", extra = "merge", fill = "right")
# Preview the resulting dataset with the new address columns
View(data)
```
```{r}
# Ensure ZIP codes are in the correct format (5 digits)
data <- data %>%
mutate(LenderZipCode = substr(LenderZipCode, 1, 5))
# Preview the resulting dataset
View(data)
```
```{r}
# Save the cleaned dataset as a CSV file
write.csv(data, "cleaned_dataset.csv", row.names = FALSE)
```
```{r}
# Import the cleaned dataset
cleaned_data <- read_csv("cleaned_dataset.csv")
```
```{r}
View(cleaned_data)
```
```{r}
glimpse(cleaned_data)
```
```{r}
# Changing the data type of both variables in the above to factor
cleaned_data <- cleaned_data %>%
mutate(BusinessAge = as.factor(BusinessAge),
BusinessType = as.factor(BusinessType))
class(cleaned_data$BusinessAge)
class(cleaned_data$BusinessType)
```
```{r}
glimpse(cleaned_data)
```
### Question 2: Return the total amount of money loaned by month from 2020 to 2021.
```{r}
# First let's add a new "Total_loan_in_dollar" column
cleaned_data <- cleaned_data %>%
mutate(Total_loan_in_dollar = Amount_in_Dollar + AmountForgiven_in_Dollar)
#cleaned_data$Total_loan_in_dollar <- cleaned_data$Amount_in_Dollar + cleaned_data$AmountForgiven_in_Dollar
#view(cleaned_data)
```
```{r}
# Convert numeric months to full month names
cleaned_data <- cleaned_data %>%
mutate(Month = month.name[Month])
```
```{r}
# Grouping by month and returning the total loan
grouped_data <- cleaned_data %>%
filter(Year %in% c(2020, 2021)) %>%
group_by(Month) %>%
summarise(Total_amount_of_money_loaned_from_2020_to_2021 = sum(Total_loan_in_dollar))
# View the grouped data
view(grouped_data)
```
```{r}
# Define the correct order for the months
month_order <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
# Convert the Month column to a factor with the specified levels
grouped_data$Month <- factor(grouped_data$Month, levels = month_order)
# Order the data frame by Month
ordered_data <- grouped_data[order(grouped_data$Month), ]
view(ordered_data)
```
### Question 3: Return the name and total money loaned for Florida's top PPP lender in 2021 ("top" = most money loaned)
```{r}
# Filtering lenders in FL in 2021, summing the loan amounts
florida_top_lender <- cleaned_data %>%
filter(Year == 2021 & LenderState == "FL") %>%
group_by(Lender) %>%
summarise(Total_loan_in_dollar = sum(Total_loan_in_dollar, na.rm = TRUE)) %>%
arrange(desc(Total_loan_in_dollar)) # Ordering from highest to lowest
# View the sorted lender data
view(florida_top_lender)
# Get the lender with the maximum total loan amount
top_lender <- florida_top_lender %>%
slice_max(Total_loan_in_dollar) # Alternative to filter for max value
top_lender
```
### Question 4: what percent of loan money went to borrowers in the state of Miami?
```{r}
# Filter for Miami and calculate total loan amount for Miami
miami_total <- data %>%
filter(BorrowerCity == "Miami") %>%
summarise(TotalMiamiLoan = sum(Amount_in_Dollar, na.rm = TRUE))
# Calculate total loan amount for all borrowers
total_loan_amount <- data %>%
summarise(TotalLoan = sum(Amount_in_Dollar, na.rm = TRUE))
# Calculate the percentage of loan money for Miami
miami_percentage <- (miami_total$TotalMiamiLoan / total_loan_amount$TotalLoan) * 100
# Display the result
miami_percentage
```
### Question 5: Bar plot showing average jobs by business age in certain groups
```{r}
# Categorize Business Age into the four specified categories
data <- data %>%
mutate(BusinessAgeCategory = recode(BusinessAge,
"Startup, Loan Funds will Open Business" = "Startup",
"2 years old or less" = "0-2 years",
"Existing or more than 2 years old" = "3+ years",
.default = "Other/Unknown"
))
# Calculate the average jobs by business age category and drop the groups
agg_tbl <- data %>%
group_by(BusinessAgeCategory) %>%
summarise(Average_Jobs = mean(Jobs), .groups = 'drop')
# Reorder the BusinessAgeCategory factor based on Average_Jobs in descending order
agg_tbl <- agg_tbl %>%
mutate(BusinessAgeCategory = fct_reorder(BusinessAgeCategory, Average_Jobs, .desc = TRUE))
# Plot the average jobs by business age category, now ordered from highest to lowest
ggplot(agg_tbl, aes(x = BusinessAgeCategory, y = Average_Jobs)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme_minimal() +
labs(title = "Average Jobs by Business Age Category", x = "Business Age Category", y = "Average Jobs") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
### Question 6: Horizontal bar plot showing total amount borrowed in each zip code in the city of Coral Gables
```{r}
# Filter for Coral Gables and only 5-digit zip codes
coral_gables_data <- cleaned_data %>%
filter(BorrowerCity == "CORAL GABLES" & nchar(as.character(BorrowerZip_Code)) == 5)
# Summarize the total amount borrowed by zipcode
zipcode_summary <- coral_gables_data %>%
group_by(BorrowerZip_Code) %>%
summarise(Total_Amount_Borrowed = sum(Amount_in_Dollar, na.rm = TRUE)) %>%
arrange(desc(Total_Amount_Borrowed))
# Convert amount to millions for better readability
zipcode_summary$Total_Amount_Borrowed <- zipcode_summary$Total_Amount_Borrowed / 1e6
# Create the horizontal bar plot
ggplot(zipcode_summary, aes(x = reorder(BorrowerZip_Code, Total_Amount_Borrowed), y = Total_Amount_Borrowed)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() + # Flip coordinates for horizontal bars
scale_y_continuous(labels = scales::dollar_format(scale = 1, prefix = "$", suffix = "M")) + # Format y-axis labels
labs(title = "Total Amount Borrowed in Coral Gables by Zip Code",
x = "Zip Code",
y = "Total Amount Borrowed (in Millions)") +
theme_minimal()
```