-
Notifications
You must be signed in to change notification settings - Fork 1
/
w10-hw-balajis2.Rmd
414 lines (302 loc) · 16.9 KB
/
w10-hw-balajis2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
---
title: "Week 10 - Homework"
author: "STAT 420, Summer 2018, BALAJI SATHYAMURTHY (BALAJIS2)"
date: '07/21/2019'
output:
html_document:
toc: yes
pdf_document: default
urlcolor: cyan
---
***
```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
options(scipen = 1, digits = 4, width = 80, fig.align = "center")
```
## Exercise 1 (Simulating Wald and Likelihood Ratio Tests)
In this exercise we will investigate the distributions of hypothesis tests for logistic regression. For this exercise, we will use the following predictors.
```{r}
sample_size = 150
set.seed(420)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)
```
Recall that
$$
p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}]
$$
Consider the true model
$$
\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1
$$
where
- $\beta_0 = 0.4$
- $\beta_1 = -0.35$
**(a)** To investigate the distributions, simulate from this model 2500 times. To do so, calculate
$$
P[Y = 1 \mid {\bf X} = {\bf x}]
$$
for an observation, and then make a random draw from a Bernoulli distribution with that success probability. (Note that a Bernoulli distribution is a Binomial distribution with parameter $n = 1$. There is no direction function in `R` for a Bernoulli distribution.)
Each time, fit the model:
$$
\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3
$$
Store the test statistics for two tests:
- The Wald test for $H_0: \beta_2 = 0$, which we say follows a standard normal distribution for "large" samples
- The likelihood ratio test for $H_0: \beta_2 = \beta_3 = 0$, which we say follows a $\chi^2$ distribution (with some degrees of freedom) for "large" samples
```{r}
set.seed(420)
num_sims = 2500
df_test_stats1 = data.frame("WaldTest" = rep(0,2500),"LRT" = rep(0,2500))
sim_logistic_data = function(sample_size = 150,beta0 = 0.4,beta1 = -0.35,beta2 = 0,beta3 = 0){
x1 = rnorm(n= sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)
eta = beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3
p = 1/(1 + exp(-eta) )
y = rbinom(n = sample_size,size = 1,prob = p)
data.frame(y,x1,x2,x3)
}
for(i in 1:num_sims){
sim_data = sim_logistic_data()
fit_glm_all = glm(y~x1+x2+x3,data = sim_data,family = "binomial")
fit_glm_small = glm(y~x1,data = sim_data,family = "binomial")
df_test_stats1[i,"WaldTest"] = summary(fit_glm_all)$coefficients["x2","z value"]
df_test_stats1[i,"LRT"] = anova(fit_glm_small,fit_glm_all,test = "LRT")[2,"Deviance"]
}
```
**(b)** Plot a histogram of the empirical values for the Wald test statistic. Overlay the density of the true distribution assuming a large sample.
```{r}
x = df_test_stats1$WaldTest
mu = mean(df_test_stats1$WaldTest)
sigma = sd(df_test_stats1$WaldTest)
hist(df_test_stats1$WaldTest,xlab = "Wald Test - z value",col = "orange",main = "Density of Wald Test - Test statistic (sample size = 150)",prob=TRUE)
curve(dnorm(x,mean = mu,sd=sigma),col = "red",add = TRUE,lwd = 3)
```
**(c)** Use the empirical results for the Wald test statistic to estimate the probability of observing a test statistic larger than 1. Also report this probability using the true distribution of the test statistic assuming a large sample.
```{r}
WaldTst1_EmpDist = mean(df_test_stats1$WaldTest > 1)
WaldTst1_TrueDist = mean(pnorm(df_test_stats1$WaldTest,mu,sigma) > 1)
```
The probability of observing a test statistic larger than 1 for empirical distribution of the Wald Test with a sample size of 150 is `r WaldTst1_EmpDist` and for true distribution is `r WaldTst1_TrueDist`
**(d)** Plot a histogram of the empirical values for the likelihood ratio test statistic. Overlay the density of the true distribution assuming a large sample.
```{r}
p = 3
q = 1
x = df_test_stats1$LRT
hist(df_test_stats1$LRT,xlab = "Deviance",col = "orange",main = "Density of LRT - Test statistic (sample size = 150)",prob = TRUE)
curve(dchisq(x, df=p-q),col = "red",add = TRUE,lwd = 3)
```
**(e)** Use the empirical results for the likelihood ratio test statistic to estimate the probability of observing a test statistic larger than 5. Also report this probability using the true distribution of the test statistic assuming a large sample.
```{r}
LRTTst1_EmpDist = mean(df_test_stats1$LRT > 5)
LRTTst1_TrueDist = mean(pchisq(df_test_stats1$LRT,p-q) > 5)
```
The probability of observing a test statistic larger than 5 for empirical distribution of the LRT Test with a sample size of 150 is `r LRTTst1_EmpDist` and for true distribution is `r LRTTst1_TrueDist`
**(f)** Repeat **(a)**-**(e)** but with simulation using a smaller sample size of 10. Based on these results, is this sample size large enough to use the standard normal and $\chi^2$ distributions in this situation? Explain.
```{r}
sample_size = 10
set.seed(420)
x1 = rnorm(n = sample_size)
x2 = rnorm(n = sample_size)
x3 = rnorm(n = sample_size)
df_test_stats2 = data.frame("WaldTest" = rep(0,2500),"LRT" = rep(0,2500))
```
```{r warning=FALSE}
for(i in 1:num_sims){
sim_data = sim_logistic_data(sample_size=10)
fit_glm_all = glm(y~x1+x2+x3,data = sim_data,family = "binomial")
fit_glm_small = glm(y~x1,data = sim_data,family = "binomial")
df_test_stats2[i,"WaldTest"] = summary(fit_glm_all)$coefficients["x2","z value"]
df_test_stats2[i,"LRT"] = anova(fit_glm_small,fit_glm_all,test = "LRT")[2,"Deviance"]
}
x = df_test_stats2$WaldTest
mu = mean(df_test_stats2$WaldTest)
sigma = sd(df_test_stats2$WaldTest)
hist(df_test_stats2$WaldTest,xlab = "Wald Test - z value",col = "orange",main = "Density of Wald Test - Test statistic (sample size = 10)",prob=TRUE)
curve(dnorm(x,mean = mu,sd = sigma),col = "red",add = TRUE,lwd = 3)
WaldTst2_EmpDist = mean(df_test_stats2$WaldTest > 1)
WaldTst2_TrueDist = mean(pnorm(df_test_stats2$WaldTest,mu,sigma) > 1)
```
The probability of observing a test statistic larger than 1 for empirical distribution of the Wald Test with a sample size of 10 is `r WaldTst2_EmpDist` and for true distribution is `r WaldTst2_TrueDist`
```{r warning=FALSE}
p = 3
q = 1
x = df_test_stats2$LRT
hist(df_test_stats2$LRT,xlab = "LRT - pvalue",col = "orange",main = "Density of LRT - Test statistic (sample size = 10)",prob = TRUE)
curve(dchisq(x, df=p-q),col = "red",add = TRUE,lwd = 3)
LRTTst2_EmpDist = mean(df_test_stats2$LRT > 5)
LRTTst2_TrueDist = mean(pchisq(df_test_stats2$LRT,p-q) > 5)
```
The probability of observing a test statistic larger than 5 for empirical distribution of the LRT Test with a sample size of 10 is `r LRTTst2_EmpDist` and for true distribution is `r LRTTst2_TrueDist`
***
## Exercise 2 (Surviving the Titanic)
For this exercise use the `ptitanic` data from the `rpart.plot` package. (The `rpart.plot` package depends on the `rpart` package.) Use `?rpart.plot::ptitanic` to learn about this dataset. We will use logistic regression to help predict which passengers aboard the [Titanic](https://en.wikipedia.org/wiki/RMS_Titanic) will survive based on various attributes.
```{r, message = FALSE, warning = FALSE}
# install.packages("rpart")
# install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
data("ptitanic")
```
For simplicity, we will remove any observations with missing data. Additionally, we will create a test and train dataset.
```{r}
ptitanic = na.omit(ptitanic)
set.seed(42)
```
**(a)** Consider the model
$$
\log\left(\frac{p({\bf x})}{1 - p({\bf x})}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_3x_4
$$
where
$$
p({\bf x}) = P[Y = 1 \mid {\bf X} = {\bf x}]
$$
is the probability that a certain passenger survives given their attributes and
- $x_1$ is a dummy variable that takes the value $1$ if a passenger was 2nd class.
- $x_2$ is a dummy variable that takes the value $1$ if a passenger was 3rd class.
- $x_3$ is a dummy variable that takes the value $1$ if a passenger was male.
- $x_4$ is the age in years of a passenger.
Fit this model to the training data and report its deviance.
```{r}
set.seed(42)
ptitanic$x1 = ifelse(ptitanic$pclass =="2nd", 1,0)
ptitanic$x2 = ifelse(ptitanic$pclass =="3rd", 1,0)
ptitanic$x3 = ifelse(ptitanic$sex =="male", 1,0)
ptitanic$x4 = ptitanic$age
ptitanic$y = ifelse(ptitanic$survived =="survived", 1,0)
trn_idx = sample(nrow(ptitanic), 300)
ptitanic_trn = ptitanic[trn_idx, ]
ptitanic_tst = ptitanic[-trn_idx, ]
fit_titanic_glm = glm(y~x1+x2+x3+x4+ x3*x4,data = ptitanic_trn,family = "binomial")
dev_value = deviance(fit_titanic_glm)
ts_x1 = summary(fit_titanic_glm)$coefficients["x1","z value"]
ts_x2 = summary(fit_titanic_glm)$coefficients["x2","z value"]
pvalue_x1 = summary(fit_titanic_glm)$coefficients["x1","Pr(>|z|)"]
pvalue_x2 = summary(fit_titanic_glm)$coefficients["x2","Pr(>|z|)"]
ts_x3x4 = summary(fit_titanic_glm)$coefficients["x3:x4","z value"]
pvalue_x3x4 = summary(fit_titanic_glm)$coefficients["x3:x4","Pr(>|z|)"]
```
The deviance of the above model is `r dev_value`
**(b)** Use the model fit in **(a)** and an appropriate statistical test to determine if class played a significant role in surviving on the Titanic. Use $\alpha = 0.01$. Report:
- The null hypothesis of the test
$$
H_0 = x_1 = x_2 = 0
$$
- The test statistic of the test
The test statistic of the Wald test are `r ts_x1` and `r ts_x1`
- The p-value of the test
The p-value of the Wald test are `r pvalue_x1` and `r pvalue_x2`
- A statistical decision
```{r}
pvalue_x1 < 0.01
pvalue_x2 < 0.01
```
At an $\alpha = 0.01$ we will reject the null hypothesis and so class plays a significant role in the model
- A practical conclusion
The upper class passengers are in the higher deck of the ship than the lower class passengers which practically makes sense for the survival.
**(c)** Use the model fit in **(a)** and an appropriate statistical test to determine if an interaction between age and sex played a significant role in surviving on the Titanic. Use $\alpha = 0.01$. Report:
- The null hypothesis of the test
$$
H_0 = x_3 = x_4 = 0
$$
- The test statistic of the test
The test statistic of the Wald test for interaction between age and sex is `r ts_x3x4`
- The p-value of the test
The p-value of the Wald test for interaction between age and sex is `r pvalue_x3x4`
- A statistical decision
```{r}
pvalue_x3x4 < 0.01
```
Since the p-value of `r pvalue_x3x4` is less than $\alpha = 0.01$ we reject the null hypothesis and hence the interaction between age and sex plays an significant role.
- A practical conclusion
Young and male gender tend to have a higher survival rate than old and female gender.
**(d)** Use the model fit in **(a)** as a classifier that seeks to minimize the misclassification rate. Classify each of the passengers in the test dataset. Report the misclassification rate, the sensitivity, and the specificity of this classifier. (Use survived as the positive class.)
```{r}
make_conf_mat = function(predicted,actual){
table(predicted = predicted,actual = actual)
}
titanic_test_pred = ifelse(predict(fit_titanic_glm,ptitanic_tst,type = "response") > 0.5,"1","0")
(conf_mat_titanic = make_conf_mat(predicted = titanic_test_pred,actual = ptitanic_tst$y))
misclass_rate = (mean(titanic_test_pred != ptitanic_tst$y))
sensitivity = conf_mat_titanic[2,2] / sum(conf_mat_titanic[,2])
specificity = conf_mat_titanic[1,1] / sum(conf_mat_titanic[,1])
```
The misclassification rate, sensitivity and specificity of the titanic model are `r misclass_rate`,`r sensitivity` and `r specificity` respectively.
***
## Exercise 3 (Breast Cancer Detection)
For this exercise we will use data found in [`wisc-train.csv`](wisc-train.csv) and [`wisc-test.csv`](wisc-test.csv), which contain train and test data, respectively. `wisc.csv` is provided but not used. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository. Only the first 10 feature variables have been provided. (And these are all you should use.)
- [UCI Page](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
- [Data Detail](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names)
You should consider coercing the response to be a factor variable if it is not stored as one after importing the data.
**(a)** The response variable `class` has two levels: `M` if a tumor is malignant, and `B` if a tumor is benign. Fit three models to the training data.
- An additive model that uses `radius`, `smoothness`, and `texture` as predictors
- An additive model that uses all available predictors
- A model chosen via backwards selection using AIC. Use a model that considers all available predictors as well as their two-way interactions for the start of the search.
For each, obtain a 5-fold cross-validated misclassification rate using the model as a classifier that seeks to minimize the misclassification rate. Based on this, which model is best? Relative to the best, are the other two underfitting or over fitting? Report the test misclassification rate for the model you picked as the best.
```{r echo=TRUE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(boot)
wisc_train_data = read.csv("wisc-train.csv")
wisc_test_data = read.csv("wisc-test.csv")
wisc_train_data$class = ifelse(wisc_train_data$class =="M", 1,0)
wisc_test_data$class = ifelse(wisc_test_data$class =="M", 1,0)
#MODEL1
wisc_mod_add_small = glm(class ~ radius + smoothness + texture,data = wisc_train_data,family = "binomial")
#MODEL2
wisc_mod_add_all = glm(class~.,data = wisc_train_data,family = "binomial")
#MODEL3
wisc_mod_all = glm(class ~ . + (radius+texture+perimeter+area+smoothness+compactness+concavity+concave+symmetry+fractal)^2,data = wisc_train_data,family = "binomial")
n = length(resid(wisc_mod_all))
fit_mod_aic_step = step(wisc_mod_all,direction = "backward")
wisc_mod_all_selected = glm(class ~ radius + texture + perimeter + concavity + concave + symmetry + fractal + radius:concavity + concavity:symmetry+concave:symmetry,data = wisc_train_data,family = "binomial")
model1_cv_misc_rate = cv.glm(wisc_train_data,wisc_mod_add_small,K=5)$delta[1]
model2_cv_misc_rate = cv.glm(wisc_train_data,wisc_mod_add_all,K=5)$delta[1]
model3_cv_misc_rate = cv.glm(wisc_train_data,wisc_mod_all_selected,K=5)$delta[1]
wisc_test_pred = ifelse(predict(wisc_mod_all_selected,wisc_test_data,type = "response") > 0.5,"1","0")
misclass_rate = (mean(wisc_test_pred != wisc_test_data$class))
```
The cross validated misclassification rate for the 3 models is listed below:
1. Additive model that uses radius, smoothness and texture as predictors: `r model1_cv_misc_rate`
2. Additive model that uses all predictors: `r model2_cv_misc_rate`
3. Model chosen via backward selection using AIC: `r model3_cv_misc_rate`
The best model based on the 5-fold validated misclassification rate is the model that uses radius, smoothness and texture as predictors. The model which uses all the predictor is underfitting whereas the model selected via backward AIC is overfitting.
The test misclassification rate for the model selected as best model is `r misclass_rate`
**(b)** In this situation, simply minimizing misclassifications might be a bad goal since false positives and false negatives carry very different consequences. Consider the `M` class as the "positive" label. Consider each of the probabilities stored in `cutoffs` in the creation of a classifier using the **additive** model fit in **(a)**.
That is, consider each of the values stored in `cutoffs` as $c$. Obtain the sensitivity and specificity in the test set for each of these classifiers. Using a single graphic, plot both sensitivity and specificity as a function of the cutoff used to create the classifier. Based on this plot, which cutoff would you use? (0 and 1 have not been considered for coding simplicity. If you like, you can instead consider these two values.)
$$
\hat{C}(\bf x) =
\begin{cases}
1 & \hat{p}({\bf x}) > c \\
0 & \hat{p}({\bf x}) \leq c
\end{cases}
$$
```{r}
make_conf_mat = function(predicted,actual){
table(predicted = predicted,actual = actual)
}
cutoffs = seq(0.01, 0.99, by = 0.01)
df_sens_spec = data.frame("cutoff" = rep(0,99),"sensitivity" = rep(0,99),"specificity" = rep(0,99))
cnt = 1
for(value in cutoffs)
{
wisc_test_pred = ifelse(predict(wisc_mod_all_selected,wisc_test_data,type =
"response") > value,"1","0")
(conf_mat_wisc = make_conf_mat(predicted = wisc_test_pred,actual =
wisc_test_data$class))
sensitivity = conf_mat_wisc[2,2] / sum(conf_mat_wisc[,2])
specificity = conf_mat_wisc[1,1] / sum(conf_mat_wisc[,1])
df_sens_spec[cnt,"cutoff"] = value
df_sens_spec[cnt,"sensitivity"] = sensitivity
df_sens_spec[cnt,"specificity"] = specificity
cnt = cnt + 1
}
library(plotrix)
cutoff_values = df_sens_spec$cutoff
sensitivity_values = df_sens_spec$sensitivity
specificity_values = df_sens_spec$specificity
twoord.stackplot(lx=cutoff_values, rx=cutoff_values, ldata=sensitivity_values,
rdata=specificity_values, lcol = "blue",
rcol= "red", ltype="l", rtype="l",
border="grey80", lylab="Sensitivity", rylab="Specificity", xlab="Cutoff",
main="Sensitivity vs. Specificity for various cutoff",incrylim=0.01)
```