-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-Build_DS_and_EDA.Rmd
322 lines (242 loc) · 13.4 KB
/
01-Build_DS_and_EDA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
---
title: "Exploratory Data Analysis on the CO_County_x_2014_2018.xlxs"
author: "David Martinez ([email protected])"
date: "May 7, 2019"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Purpose
To perform Exploratory Data Analysis on a dataset containing four years of Colorado county level medical and recreational marijuana sales as well as state revenue (taxes) collected. Two excel workbook sheets are imported to R and merged together, one for sales, the other for taxes. <br>
#### 1. Sales file description (CO_County_Sales_2014_2018.xlsx)<br>
The sales files contains not only county level medical and recreational sales by year, but also population information and location information (State, County, Latitude, Longitude, Region). Additionally, medical and recreational sales for each county were applied against county population to determine an average of sales per county citizen for both medical and recreational sales.<br>
**Dataset fields:**<br>
State - Currently only "COLORADO"<br>
County - Colorado County Name (e.g., "Adams" or "Yuma")<br>
Latitude - Latitude of County center<br>
Longitude - Longitude of County Center<br>
Region - An arbitrary assignment I made to quarter the state into geographic quadrants.<br>
Year - Collection Year<br>
Population - Estimated population between census reporting periods<br>
Med_Sales - County level sales of Medical Marijuana (see value explanation below)<br>
Rec_Sales - County level sales of Recreational Marijuana (see value explanation below)<br>
med_sales_per_citizen - a calculated value determined by dividing the "Med_Sales" value by the "Population" value.<br>
rec_sales_pre_citizen - a calculated value determined by dividing the "Rec_Sales" value by the "Population" value.<br>
Med_Sales, Rec_sales, and the two calculated values have three possible values:<br>
**0** = No Sales of legal Marijuana occurred in that county. The original source material did not include counties that had no sales. This information was added to show a full statewide picture as well as county adoption over time.<br>
**NR** = Not releasable due to confidentiality requirements. The sum of all NR counties ("Not Reported" in the 'County' column) are captured as the last line for each year.<br>
**x** = A positive number representing sales at the dollar level.<br><br>
#### 2. Taxes file description (CO_County_Taxes_2014_2018.xlsx)<br>
The taxes file contains taxes collected per county in three columns: Medical Sales Tax (2.9%), Retail Sales Tax (2.9%), Retail Marijuana Special Sales Tax.
**Dataset fields:**<br>
County - Colorado County Name (e.g., "Adams" or "Yuma")<br>
Year - Collection Year<br>
Medical Sales Tax (2.9%) - Sales tax applied to medical marijuana only. This is the only state tax paid.
Retail Sales Tax (2.9%) - Sales tax applied to retail marijuana. Starting in 2018, this tax was no longer collected.
Retail Marijuana Special Sales Tax - an additional tax on retail marijuana sales.
Medical Sales Tax (2.9%), Retail Sales Tax (2.9%), Retail Marijuana Special Sales Tax have three possible values:<br>
**0** = No taxes from legal Marijuana occurred in that county. The original source material did not include counties that had no tax information. This information was added to show a full statewide picture as well as county adoption over time.<br>
**NR** = Not releasable due to confidentiality requirements. The sum of all NR counties ("Not Reported" in the 'County' column) are captured as the last line for each year.<br>
**x** = A number representing taxes at the dollar level. Negative values indicate previous months overpayment of taxes being returned.<br><br>
## Data Collection and Merging steps
### Dependencies
If needed, these packages can be installed using the install.packages() function
```{r message=FALSE, warning=FALSE}
library("readxl")
library("formattable")
library("tidyverse")
library("tidyr")
library("ggplot2")
library("ggrepel")
```
### Collect and Merge data
The two files are read into tibbles and then merged into a dataframe. Data is subsetted to remove 2018 values. A loop is performed to convert the sales and tax features to numeric and currency.
```{r}
#gather sales and tax data into tibbles
sales_mj <- read_xlsx("CO_County_Sales_2014_2018.xlsx", sheet = "aggregate", range = NULL, col_names = TRUE)
taxes_mj <- read_xlsx("CO_County_Taxes_2014_2018.xlsx", sheet = "aggregate", range = NULL, col_names = TRUE)
#merge the two tibbles - {base} merge returns a dataframe
CCMDs <- merge(sales_mj, taxes_mj)
#remove 2018 values until both spreadsheets are fully populated
CCMDs <- subset(CCMDs, Year < "2018")
#create a list of column names for columns 8 through 14 - these are the columns related to sales and taxes
cashcol <- colnames(CCMDs[8:14])
#loop to convert sales/tax columns to numeric/currency - this will introduce NAs for each of the 7 columns (resulting in a warning for each column)
for (i in cashcol) {
CCMDs[[i]] <- as.numeric(CCMDs[[i]]) #character to numeric
CCMDs[[i]] <- currency(CCMDs[[i]], digits = 0L) #numeric but with currency symbology
}
#clean up unneeded files
rm(cashcol, i)
rm(sales_mj, taxes_mj)
#write dataframe to disk
write_excel_csv(CCMDs, "CCMDs.csv", na = "NA", append = FALSE)
```
## Exploratory Data Analysis
```{r}
summary(CCMDs)
```
<br>
There are 260 observation with 14 variables. There are 346 NA's (~10%) that will need to be addressed.
```{r}
#count the NAs
sum(is.na(CCMDs))
#remove the NAs
CCMDs <- na.omit(CCMDs)
plot(CCMDs[,7:14])
```
<br>Right from the start, the data are redundant. Specifically, the 'med_sales_per_citizen' and 'rec_sales_per_citizen' variables which are calculated by dividing the county sales by the county population. Similarly, the tax data is also a function of the sales data. For now, I'm going to ignore that data.
```{r}
library(corrplot)
#subset out unnecessary columns
plot(CCMDs[,7:9])
#corrplot the value features
m <- cor(CCMDs[, c(7:9)], use = "complete.obs", method = "spearman")
corrplot(m, method="number", type = "lower", order = "hclust", tl.srt = 45)
```
<br>Of course, it makes sense that population would correlate to sales and that med/rec sales would correlate so highly to each other.
```{r}
CCMDs_f1 <- filter(CCMDs, Med_Sales != 0 | Rec_Sales != 0)
CCMDs_f2 <- filter(CCMDs, Med_Sales == 0 & Rec_Sales == 0)
ggplot(data=CCMDs_f1, aes(x=as.factor(Year)))+
geom_boxplot(aes(y=Rec_Sales/100000), color="green", show.legend=TRUE)+
geom_boxplot(aes(y=Med_Sales/100000), color="red", show.legend=TRUE)+
labs(title="Aggregate Retail and Medical Marijuana Sales since 2014 by Year", x= "Year", y= "Sales")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
ggplot(data=CCMDs_f1, aes(x=as.factor(County)))+
geom_boxplot(aes(y=Rec_Sales/100000), color="green", show.legend=TRUE)+
geom_boxplot(aes(y=Med_Sales/100000), color="red", show.legend=TRUE)+
labs(title="Aggregate Retail and Medical Marijuana Sales since 2014 by County", x= "County", y= "Sales per $100K")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
```
```{r}
#disable scientific notation
options(scipen = 999)
#table(CCMDs[, 1,8:9])
#filter by year
ggplot(data=CCMDs, aes(x = Population))+
geom_point(aes(y = Rec_Sales/100000), color="green", show.legend = TRUE, size = 1.25)+
geom_point(aes(y = Med_Sales/100000), color="red", show.legend = TRUE, size = 1.25)+
labs(title="Medical and Retail Marijuana Sales as a measure of population", x= "Colorado County Population", y= "Marijuana Sales per $100K")+
theme_bw()+
coord_flip()
```
<br> Four clusters seem to be immediately evident (low pop/low sales, med pop/low sales, high pop/low sales, and high pop/higher sales starting around 1000K). Cluster analysis will need to be performed to validate optimal cluster size and groupings.
```{r}
#https://www.r-bloggers.com/finding-optimal-number-of-clusters/
#first scale and sequester the data fields of interest into a matrix.
CCMDs_scale <- scale(CCMDs[,7:9])
#use elbow method to determine optimal number of clusters
set.seed(12345)
#set the max number of clusters
k.max <- 15
#generate within-cluster sum of squares
wss <- sapply(1:k.max, function(k){kmeans(CCMDs_scale, k, nstart=50, iter.max=15)$tot.withinss})
#generate elbow plot
plot(1:k.max, wss, type="b", pch=19, frame=FALSE, xlab="number of clusters K", ylab="Total within-clusters sum of squares")
```
<br> It seems that three clusters was the magic number. I'll also use NbClust to additionally validate the number of clusters. NbClust uses multiple indices to determine the number of clusters and most optimal clustering scheme. Optimal clusters are by determining the amount of variation between the data points.
```{r}
#install.packages("NbClust", dependencies=TRUE)
library(NbClust)
nb <- NbClust(CCMDs_scale, distance="euclidean", min.nc=2, max.nc = 5, method="kmeans", index="all", alphaBeale = 0.1)
hist(nb$Best.nc[1,], breaks = max(na.omit(nb$Best.nc[1,])))
```
<br> three clusters it is.
```{r}
#use kmeans to perform cluster identification
cluster <- kmeans(CCMDs_scale, centers=3, iter.max=50, nstart=5)
#bind the cluster field to to the values from scale
cluster <- cluster$cluster
#convert cluster number to factor
cluster <- as.factor(cluster)
ggplot(data=CCMDs, aes(x = Population/1000, color=cluster))+
geom_point(aes(y = Rec_Sales/100000), show.legend = TRUE, size = 1.25)+
geom_point(aes(y = Med_Sales/100000), show.legend = TRUE, size = 1.25)+
labs(title="Medical and Retail Marijuana Sales as a measure of population", x= "Colorado County Population per 1K", y= "Marijuana Sales per $100K")+
theme_bw()+
coord_flip()
```
<br> It seems that three clusters was the magic number. One question becomes immediately obvious. Why are there areas with a high population (>600K) but low sales? This will need to be further investigated.
```{r}
CCMDs_f3 <- filter(CCMDs, Population > 600000)
ggplot(data=CCMDs, aes(x = Population/1000, color=cluster))+
geom_point(aes(y = Rec_Sales/100000), show.legend = TRUE, size = 1.25)+
stat_ellipse(aes(y = Rec_Sales/100000))+
geom_point(aes(y = Med_Sales/100000), show.legend = TRUE, size = 1.25)+
stat_ellipse(aes(y = Med_Sales/100000))+
labs(title="Medical and Retail Marijuana Sales against county population", x= "Colorado County Population per 1K", y= "Marijuana Sales per $100K")+
theme_bw()+
coord_flip()
ggplot(data=CCMDs, aes(x = Population/1000, color=cluster))+
geom_point(aes(y = Rec_Sales/100000), show.legend = TRUE, size = 1.25)+
stat_ellipse(aes(y = Rec_Sales/100000))+
theme_bw()+
coord_flip()
```
```{r}
```
ref: http://rpubs.com/sinhrks/plot_pca
<br>
<br>
<br>
spare or inwork stuff...
ggplot(data=CCMDs, aes(x = Population))+
geom_point(aes(y = Rec_Sales/100000), color="green", show.legend = TRUE, size = 1.25)+
labs(title="Retail Marijuana Sales as a measure of population", x= "Colorado County Population", y= "Marijuana Sales per $100K")+
theme_bw()+
coord_flip()
ggplot(data=CCMDs, aes(x = Population))+
geom_point(aes(y = Med_Sales/100000), color="red", show.legend = TRUE, size = 1.25)+
labs(title="Medical Marijuana Sales as a measure of population", x= "Colorado County Population", y= "Marijuana Sales per $100K")+
theme_bw()+
coord_flip()
ggplot(data=CCMDs, mapping=aes(x=Year))+
geom_col(mapping=aes(y=Rec_Sales/100000), position=position_dodge(), fill="green")+
geom_col(mapping=aes(y=Med_Sales/100000), position=position_dodge() , fill="red")+
labs(title="Colorado Medical and Retail Marijuana Sales 2014-2018", x= "Years of Recreational Legalization", y= "Sales per $100K")+
theme_bw()
ggplot(data=CCMDs, aes(x=as.factor(CCMDs$County)))+
geom_boxplot(aes(y=Rec_Sales), color="green")+
geom_boxplot(aes(y=Med_Sales), color="red")+
labs(title="Colorado Retail and Medical Marijuana Sales since 2014", x= "Colorado County", y= "Sales")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(data=CCMDs, mapping=aes(x=Year))+
geom_jitter(aes(y=Rec_Sales), color="green")+
geom_jitter(aes(y=Med_Sales), color="red")+
labs(title="Colorado Medical and Retail Marijuana Sales 2014-2018", x= "Years of Recreational Legalization", y= "Sales per $100K")+
theme_bw()
CCMDs_pop <- CCMDs[, c(1,2,7)]
CCMDs_pop$County <- as.factor(CCMDs_pop$County)
CCMDs_pop <- CCMDs_pop[order(CCMDs_pop$Population),]
str(CCMDs_pop)
#CCMDs_pop <- CCMDs_pop[order(CCMDs_pop$Population),]
View(CCMDs_pop)
#ggplot(data=CCMDs, mapping=aes(x=Year, y=Population, group=Region))+
# geom_line(color="red")+
# geom_point(color="red")+
# scale_y_continuous(limits = c(min(CCMDs$Population), max(CCMDs$Population)))+
# geom_label_repel(aes(label = Population), nudge_x = 1)+
# labs(title="Colorado Estimated Population Growth since 2010 Census")+
# theme_bw()
#CCMDs_pop <- data.frame(order(CCMDs_pop$Population))
ggplot(data=CCMDs_pop, mapping=aes(x=County, y=Population))+
geom_bar(stat="identity", position="dodge", fill="blue")+
labs(title="Colorado Population by County", x= "County", y= "Population")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
ggplot(data=CCMDs_pop, aes(x=as.factor(CCMDs_pop$County)))+
geom_boxplot(aes(y=Population), color="blue")+
labs(title="Colorado Population by County", x= "County", y= "Population")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
#corrplot the value features
#m <- cor(CCMDs[, c(8:14)], use = "complete.obs", method = "spearman")
#require("corrplot")
#corrplot(m, type = "upper", order = "hclust", tl.srt = 45)