forked from pjsio/ME114
-
Notifications
You must be signed in to change notification settings - Fork 0
/
ME114_assignment7_solution.Rmd
214 lines (178 loc) · 10.4 KB
/
ME114_assignment7_solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
---
title: "Assignment 7 - Machine Learning"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---
Assignments for the course focus on practical aspects of the concepts covered in the lectures. Assignments are largely based on the material covered in James et al. (2013). You will start working on the assignment in the lab sessions after the lectures, but may need to finish them after class.
You will be asked to submit your assignments via Moodle by 7pm on the day of the class. We will subsequently open up solutions to the problem sets.
### Exercise summary
This exercise revisits regression as a "machine learning" tool, both for linear and logisitic regression, as well as simple machine learning using the Naive Bayes and kNN algorithms.
For this execise we will use the Stata dataset `dail2002.dta` from the article Kenneth Benoit and Michael Marsh. 2008. "[The Campaign Value of Incumbency: A New Solution to the Puzzle of Less Effective Incumbent Spending.](http://www.kenbenoit.net/pdfs/ajps_348.pdf)" *American Journal of Political Science* 52(4, October): 874-890. This is available directly [here](http://www.kenbenoit.net/files/dail2002.dta). To load this into R, you will need the `read.dta` command from the `foreign` package. (Note that you can load straight from the URL using this command.) Call this data object `dail2002`.
```{r}
require(foreign, quietly = TRUE)
dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta")
```
1. **Partitioning the dataset into "folds".** For this exercise, we will be fitting a model from
a subset of the data, and using the fitted model to predict the outcomes from the "left out" set
and using that to evaluate root mean squared error (RMSE) or accuracy.
Use the `sample()` command to draw a random sample of one-fifth of the observations in the `dail2002.dta` dataset as a test set, and four-fifths to be a training set.
```{r}
dail2002$tclass <- sample(c("test", "train"), nrow(dail2002), replace = TRUE, prob = c(.2, .8))
```
We will use this indicator variable `tclass` to partition the dataset in the questions below. Be sure you understand how we used the `sample()` command this way!
2. **OLS regression for prediction.**
1. Fit a regression from the dataset to predict `votes1st`. You may use any combination of
regressors that you wish. Save the model object to `reg2_1`.
```{r}
require(foreign)
dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta")
reg2_1 <- lm(votes1st ~ incumb:spend_total + senator + gender + electorate, data=dail2002)
```
2. Predict the `votes1st` from the same sample to which you fitted the regression. What is the
Root Mean Squared Error (RMSE) and how would you interpret this?
```{r}
length(predict(reg2_1)) # shortened to 462 because of the two missing cases
# refit model with different behaviour for na.action
reg2_1 <- lm(votes1st ~ incumb*spend_total + senator + gender + electorate,
data=dail2002, na.action=na.exclude)
reg2_1pred <- predict(reg2_1)
length(reg2_1pred) # now includes missing cases
sum(is.na(reg2_1pred))
# function to compute RMSE
computeRMSE <- function(myvar) {
sqrt(sum(residuals(myvar)^2, na.rm=TRUE) / df.residual(myvar))
}
computeRMSE(reg2_1)
# compare to output from summary()
summary(reg2_1)
```
3. Drop the incumbency variable -- that you hopefully included in your answer to 2.1! -- and
repeat steps 2.1--2.2. Compute a new RMSE and compare this to the previous one. Which
is a better predictor?
```{r}
reg2_3 <- lm(votes1st ~ spend_total + senator + gender + electorate,
data=dail2002, na.action=na.exclude)
computeRMSE(reg2_3)
summary(reg2_3)$r.squared
```
**The model with incumbency is a better predictor, with a lower RMSE and a higher $R^2$.**
3. **Logistic regression for prediction**.
1. Fit a logistic regression (hint: use `glm()`) to predict the outcome variable `wonseat`. Use
any specification that you think provides a good prediction.
```{r}
reg3_1 <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate, data=dail2002, family=binomial, na.action=na.exclude)
```
2. For the full sample, compute:
* a table of actual `wonseat` by predicted `wonseat`
```{r}
wonseat_predicted <- ifelse(predict(reg3_1) > 0, 1, 0)
(predtable <- table(wonseat_predicted, dail2002$wonseat))
```
* percent correctly predicted
```{r}
sum(diag(prop.table(predtable))) * 100
```
* precision
* recall
```{r}
# define a general function to compute precision & recall
precrecall <- function(mytable, verbose=TRUE) {
truePositives <- mytable[1,1]
falsePositives <- sum(mytable[1,]) - truePositives
falseNegatives <- sum(mytable[,1]) - truePositives
precision <- truePositives / (truePositives + falsePositives)
recall <- truePositives / (truePositives + falseNegatives)
if (verbose) {
print(mytable)
cat("\n precision =", round(precision, 2),
"\n recall =", round(recall, 2), "\n")
}
invisible(c(precision, recall))
}
# compute precision of wonseat (not lost seat!)
precrecall(predtable[2:1, 2:1])
```
3. Comparing two models.
* Compute an 8-fold validation, where for 8 different training sets consisting of 7/8 of the observations, you predict the other held-out 1/8 and compare the actual to predicted for the 1/8 test set. Compute an average F1 score for the 8 models.
```{r}
# here it helps to know that n=464 and 464 %% 8 == 0
foldIndex <- rep(1:8, each = nrow(dail2002)/8)
# initialize vector to store F1 scores
f1result <- c()
# loop over folds
cat("fold number: ")
for (i in 1:max(foldIndex)) {
cat(i)
glmresult <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate,
na.action = na.exclude, family = "binomial",
data = subset(dail2002, foldIndex != i))
wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
f1result[i] <- 2 * prod(prrec) / sum(prrec)
}
cat("\n")
f1result
mean(f1result)
```
* Now drop a variable or two, and repeat the previous step to compare the average F1 score for this model.
```{r}
f1result <- c()
# loop over folds
for (i in 1:8) {
glmresult <- glm(wonseat ~ spend_total + senator + gender + electorate,
na.action = na.exclude, family = "binomial",
data = subset(dail2002, foldIndex != i))
wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
f1result[i] <- 2 * prod(prrec) / sum(prrec)
}
f1result
mean(f1result)
```
* Why is it valuable to use the different folds here, rather than simply comparing the F1 score for the predicted outcome of the entire sample, fit to the entire sample?
**Because fitting on the whole sample can lead to overfitting. For calibrating a predictive model we need to test it out of sample.**
4. **kNN prediction.**
1. Fit a $k=1$ kNN model to predict the same specification as in your logistic regression. Compare the
percent correctly predicted. Which model worked better? (Note: You can use the whole sample here, so that this will compare the results to those from 3.1).
```{r}
# get the variables we want
dail2002sub <- subset(dail2002, select = c(spend_total, incumb, gender, electorate))
# convert the factor to numeric 0 or 1
dail2002sub$gender <- unclass(dail2002sub$gender) - 1
# figure out which are complete
completeCasesIndex <- complete.cases(dail2002sub)
# set the training and test datasets to be identical
train <- test <- dail2002sub[completeCasesIndex, ]
require(class)
# make the true class a factor
wonseatf <- factor(dail2002$wonseat[completeCasesIndex], labels = c("No", "Yes"))
knnPred <- knn(train, test, wonseatf, k=1)
precrecall(table(knnPred, wonseatf)[2:1, 2:1])
```
2. Experiment with two more settings of $k$ to see how this affects prediction, reporting the percent
correctly predicted.
```{r}
# k = 2
precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])
# k = 3
precrecall(table(knnPred <- knn(train, test, wonseatf, k=3), wonseatf)[2:1, 2:1])
# k = 4
precrecall(table(knnPred <- knn(train, test, wonseatf, k=4), wonseatf)[2:1, 2:1])
# k = 5
precrecall(table(knnPred <- knn(train, test, wonseatf, k=5), wonseatf)[2:1, 2:1])
```
We can also plot precision by recall to get an ROC curve-type plot, to evaluate performance at each level of `k`.
Note that this will not produce "curves" because we are not fitting to different folds or datasets, just once
to recover the predictions from the whole training set.
```{r}
maxk <- 10
prresults <- data.frame(precision = NA, recall = NA)
for (k in 1:maxk) {
prresults[k, ] <- precrecall(table(knnPred <- knn(train, test, wonseatf, k=k), wonseatf)[2:1, 2:1],
verbose=FALSE)
}
with(prresults, plot(recall, precision, type="n", xlim=c(.5, 1), ylim=c(.5, 1)))
with(prresults, text(recall, precision, paste0("k=", 1:maxk)))
```