forked from pjsio/ME114
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ME114_exam_solution.Rmd
285 lines (222 loc) · 9.86 KB
/
ME114_exam_solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
title: "Exam - Introduction to Data Science and Big Data Analytics"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---
## Question 1
Using the `Boston` data set predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.
(a) For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.
```{r}
data(Boston, package = "MASS")
summary(Boston)
Boston$chas <- factor(Boston$chas, labels = c("N", "Y"))
summary(Boston)
attach(Boston)
lm.zn <- lm(crim ~ zn)
summary(lm.zn) # yes
lm.indus <- lm(crim ~ indus)
summary(lm.indus) # yes
lm.chas <- lm(crim ~ chas)
summary(lm.chas) # no
lm.nox <- lm(crim ~ nox)
summary(lm.nox) # yes
lm.rm <- lm(crim ~ rm)
summary(lm.rm) # yes
lm.age <- lm(crim ~ age)
summary(lm.age) # yes
lm.dis <- lm(crim ~ dis)
summary(lm.dis) # yes
lm.rad <- lm(crim ~ rad)
summary(lm.rad) # yes
lm.tax <- lm(crim ~ tax)
summary(lm.tax) # yes
lm.ptratio <- lm(crim ~ ptratio)
summary(lm.ptratio) # yes
lm.black <- lm(crim ~ black)
summary(lm.black) # yes
lm.lstat <- lm(crim ~ lstat)
summary(lm.lstat) # yes
lm.medv <- lm(crim ~ medv)
summary(lm.medv) # yes
```
**All, except chas. Plot each linear regression using `plot(lm)` to see residuals.**
(b) Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis $H_0 : \beta_j = 0$?
```{r}
lm.all <- lm(crim ~ . , data = Boston)
summary(lm.all)
```
**n, dis, rad, black, medv**
(c) How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the $x$-axis, and the multiple regression coefficients from (b) on the $y$-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regression model is shown on the $x$-axis, and its coefficient estimate in the multiple linear regression model is shown on the $y$-axis.
```{r}
x <- c(coefficients(lm.zn)[2],
coefficients(lm.indus)[2],
coefficients(lm.chas)[2],
coefficients(lm.nox)[2],
coefficients(lm.rm)[2],
coefficients(lm.age)[2],
coefficients(lm.dis)[2],
coefficients(lm.rad)[2],
coefficients(lm.tax)[2],
coefficients(lm.ptratio)[2],
coefficients(lm.black)[2],
coefficients(lm.lstat)[2],
coefficients(lm.medv)[2])
y <- coefficients(lm.all)[2:14]
plot(x, y)
```
**Coefficient for `nox` is approximately -10 in univariate model and 31 in multiple regression model.**
## Question 2
Using the `Boston` data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, and KNN models using various subsets of the predictors. Describe your findings.
```{r}
attach(Boston)
crime01 <- as.factor(ifelse(crim > median(crim), "Above", "Below"))
train <- 1:(dim(Boston)[1]/2)
test <- (dim(Boston)[1]/2 + 1):dim(Boston)[1]
Boston.train <- Boston[train, ]
Boston.test <- Boston[test, ]
crime01.test <- crime01[test]
```
```{r}
# logistic regression
glm.fit <- glm(crime01 ~ . - crime01 - crim, data = Boston, family = binomial,
subset = train)
```
```{r}
glm.probs <- predict(glm.fit, Boston.test, type = "response")
glm.pred <- rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] <- 1
mean(glm.pred != crime01.test)
```
**18.2% test error rate.**
```{r}
glm.fit <- glm(crime01 ~ . - crime01 - crim - chas - tax, data = Boston, family = binomial, subset = train)
```
```{r}
glm.probs <- predict(glm.fit, Boston.test, type = "response")
glm.pred <- rep(0, length(glm.probs))
glm.pred[glm.probs > 0.5] <- 1
mean(glm.pred != crime01.test)
```
**18.6% test error rate.**
```{r}
# KNN
library(class)
train.X <- cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black,
lstat, medv)[train, ]
test.X <- cbind(zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black,
lstat, medv)[test, ]
train.crime01 <- crime01[train]
set.seed(1)
# KNN(k=1)
knn.pred <- knn(train.X, test.X, train.crime01, k = 1)
mean(knn.pred != crime01.test)
```
**45.8% test error rate.**
```{r}
# KNN(k=10)
knn.pred <- knn(train.X, test.X, train.crime01, k = 10)
mean(knn.pred != crime01.test)
```
**11.1% test error rate.**
```{r}
# KNN(k=100)
knn.pred <- knn(train.X, test.X, train.crime01, k = 100)
mean(knn.pred != crime01.test)
```
**49.0% test error rate.**
```{r}
# KNN(k=10) with subset of variables
train.X <- cbind(zn, nox, rm, dis, rad, ptratio, black, medv)[train, ]
test.X <- cbind(zn, nox, rm, dis, rad, ptratio, black, medv)[test, ]
knn.pred <- knn(train.X, test.X, train.crime01, k = 10)
mean(knn.pred != crime01.test)
```
**28.5% test error rate.**
## Question 3
Using the `Boston` housing data set, from the `MASS` library.
```{r}
library(MASS)
summary(Boston)
set.seed(1)
```
a) Standard error of median of crim:
```{r}
nreplicates <- 1000
bsresult <- numeric(nreplicates) # set up a results variable, length 1000
for (i in 1:nreplicates) {
bsresult[i] <- median(sample(Boston$crim, replace = TRUE))
}
sd(bsresult)
# or:
require(boot)
boot.fn <- function(data, index) return(median(data[index]))
boot(Boston$crim, boot.fn, 1000)
```
b) Estimate a bootstrapped standard error for the coefficient of medv in a logistic regression model of the above/below median of crime binary variable from question 2, with medv, indus, age, black, and ptratio as predictors. Compare this to the asymptotic standard error from the maximum likelihood estimation (reported by summary.glm()).
```{r}
require(boot)
med <- summary(Boston$crim)[3] #returns median
Boston$bicrim <- ifelse(Boston$crim < median(Boston$crim), 0 , 1)
glm.bootfit <- function(data, index) {
glm.fit <- glm(Boston$bicrim ~ medv + indus + black + ptratio, data=data, subset=index, family=binomial)
return(coef(glm.fit)[2]) # first is intercept so we do for second ceof which is medv
}
boot(Boston, glm.bootfit, R=100)
# compare to ML s.e.
summary(glm(bicrim ~ medv + indus + black + ptratio, data = Boston, family=binomial))
```
**Bootstrapped result is approx. 0.01323143 whereas ML s.e. is 0.017074.**
### Question 4
```{r}
require(quanteda, quietly = TRUE, warn.conflicts = FALSE)
partyDfm <- dfm(ie2010Corpus, groups = "party", verbose = FALSE)
populismDict <- dictionary(list(populism = c("elit*", "consensus*", "undemocratic",
"referend*", "corrupt*", "propagand*",
"politici*", "*deceit*", "*deceiv", "*betray*",
"shame*", "scandal*", "truth*",
"dishonest*","establishm*","ruling*")))
partyDfmPop <- dfm(ie2010Corpus, dictionary = populismDict, groups = "party")
dotchart(as.matrix(partyDfmPop / rowSums(partyDfm))[,1], xlab = "Populism Proportion")
```
### Question 5
Here we will use kmeans clustering to see if we can produce groupings by party of the 1984 US House of Representatives, based on their voting records from 16 votes. This data is the object `HouseVotes84` from the `mlbench` package. Since this is stored as a list of factors, use the following code to transform it into a method that will work with the `kmeans()` function.
```{r}
data(HouseVotes84, package = "mlbench")
HouseVotes84num <- as.data.frame(lapply(HouseVotes84[, -1], unclass))
HouseVotes84num[is.na(HouseVotes84num)] <- 0
set.seed(2)
```
a. What does each line of that code snippet do, and why was this operation needed? What is the `-1` indexing for?
**The `lapply(HouseVotes84[, -1], unclass)` converts the factor into a 1 for Nay, 2 for Yea. `as.data.frame()` ensures that this object is a data.frame. The `[-1]` removes the `Class` column so that we have only votes in our dataset. The `is.na()` and 0 assignment converts `NA` values (representing abstentions) into zeroes.**
b. Perform a kmeans clustering on the votes only data, for 2 classes, after setting the seed to 100 as per above. Construct a table comparing the actual membership of the Congressperson's party (you will find this as one of the variables in the `HouseVotes84` data) to the cluster assigned by the kmeans procedure. Report the
i. accuracy
ii. precision
iii. recall
```{r}
# define a general function to compute precision & recall
precrecall <- function(mytable, verbose=TRUE) {
truePositives <- mytable[1,1]
falsePositives <- sum(mytable[1,]) - truePositives
falseNegatives <- sum(mytable[,1]) - truePositives
precision <- truePositives / (truePositives + falsePositives)
recall <- truePositives / (truePositives + falseNegatives)
if (verbose) {
print(mytable)
cat("\n precision =", round(precision, 2),
"\n recall =", round(recall, 2),
"\n accuracy =", sum(diag(mytable)) / sum(mytable),
"\n")
}
invisible(c(precision, recall))
}
# compute the tables
set.seed(2) # just to set to requested value (again)
(kmt <- table(kmeans(HouseVotes84num, 2)$cluster, HouseVotes84[,1]))
precrecall(kmt)
```
c. Repeat b twice more to produce three more confusion matrix tables, comparing the results. Are they the same? If not, why not?
```{r}
(kmt <- table(kmeans(HouseVotes84num, 2)$cluster, HouseVotes84[,1]))
precrecall(kmt)
```
**The result is not the same because there are random starting values for the clusters. The second time we perform the clustering, different random starting values reversed the cluster labels. But since these are unsupervised -- since kmeans() has no idea what the two different groups are supposed to represent -- then one answer is as good as the other.**