-
Notifications
You must be signed in to change notification settings - Fork 1
/
w08-hw-balajis2.Rmd
417 lines (300 loc) · 17.5 KB
/
w08-hw-balajis2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
---
title: "Week 8 - Homework"
author: "STAT 420, Summer 2018, BALAJI SATHYAMURTHY (BALAJIS2)"
date: ''
output:
html_document:
toc: yes
pdf_document: default
urlcolor: cyan
---
***
```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
options(scipen = 1, digits = 4, width = 80, fig.alin = "center")
```
## Exercise 1 (Writing Functions)
**(a)** Write a function named `diagnostics` that takes as input the arguments:
- `model`, an object of class `lm()`, that is a model fit via `lm()`
- `pcol`, for controlling point colors in plots, with a default value of `grey`
- `lcol`, for controlling line colors in plots, with a default value of `dodgerblue`
- `alpha`, the significance level of any test that will be performed inside the function, with a default value of `0.05`
- `plotit`, a logical value for controlling display of plots with default value `TRUE`
- `testit`, a logical value for controlling outputting the results of tests with default value `TRUE`
The function should output:
- A list with two elements when `testit` is `TRUE`:
- `p_val`, the p-value for the Shapiro-Wilk test for assessing normality
- `decision`, the decision made when performing the Shapiro-Wilk test using the `alpha` value input to the function. "Reject" if the null hypothesis is rejected, otherwise "Fail to Reject."
- Two plots, side-by-side, when `plotit` is `TRUE`:
- A fitted versus residuals plot that adds a horizontal line at $y = 0$, and labels the $x$-axis "Fitted" and the $y$-axis "Residuals." The points and line should be colored according to the input arguments. Give the plot a title.
- A Normal Q-Q plot of the residuals that adds the appropriate line using `qqline()`. The points and line should be colored according to the input arguments. Be sure the plot has a title.
Consider using this function to help with the remainder of the assignment as well.
```{r}
diagnostics = function(model,pcol = "darkorange",lcol = "red",alpha = 0.05,plotit = TRUE,testit= TRUE){
diag_return = data.frame(p_val = "0", decision = "0")
if(testit){
diag_return["p_val"] = shapiro.test(resid(model))$p.value
if(shapiro.test(resid(model))$p.value < alpha){
diag_return["decision"] = "Reject"
}
else{
diag_return["decision"] = "Fail to Reject"
}
}
if(plotit){
par(mfrow = c(1,2))
plot(fitted(model),resid(model),col=pcol,xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
abline(h=0,col=lcol,lwd=2)
qqnorm(resid(model),col = pcol,main = "Normal Q-Q Plot")
qqline(resid(model),lty=2,lwd=2,col = lcol)
}
diag_return
}
```
**(b)** Run the following code.
```{r}
set.seed(420)
data_1 = data.frame(x = runif(n = 30, min = 0, max = 10),
y = rep(x = 0, times = 30))
data_1$y = with(data_1, 2 + 1 * x + rexp(n = 30))
fit_1 = lm(y ~ x, data = data_1)
data_2 = data.frame(x = runif(n = 20, min = 0, max = 10),
y = rep(x = 0, times = 20))
data_2$y = with(data_2, 5 + 2 * x + rnorm(n = 20))
fit_2 = lm(y ~ x, data = data_2)
data_3 = data.frame(x = runif(n = 40, min = 0, max = 10),
y = rep(x = 0, times = 40))
data_3$y = with(data_3, 2 + 1 * x + rnorm(n = 40, sd = x))
fit_3 = lm(y ~ x, data = data_3)
```
```{r fig.height=5, fig.width=10}
diagnostics(fit_1, plotit = FALSE)$p_val
diagnostics(fit_2, plotit = FALSE)$decision
diagnostics(fit_1, testit = FALSE, pcol = "black", lcol = "black")
diagnostics(fit_2, testit = FALSE, pcol = "grey", lcol = "green")
diagnostics(fit_3)
```
***
## Exercise 2 (Prostate Cancer Data)
For this exercise, we will use the `prostate` data, which can be found in the `faraway` package. After loading the `faraway` package, use `?prostate` to learn about this dataset.
```{r, message = FALSE, warning = FALSE}
library(faraway)
```
**(a)** Fit an additive multiple regression model with `lpsa` as the response and the remaining variables in the `prostate` dataset as predictors. Report the $R^2$ value for this model.
```{r}
lpsa_add_model = lm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data = prostate)
lpsa_rsquared = summary(lpsa_add_model)$r.squared
```
The r-squared of the additive lpsa model is `r lpsa_rsquared`
**(b)** Check the constant variance assumption for this model. Do you feel it has been violated? Justify your answer.
```{r}
library(lmtest)
bptest_pval = bptest(lpsa_add_model)$p.value
```
The p-value of the BP test is `r bptest_pval` which is high and at an alpha of 0.05, we failed to reject the null hypothesis, which means the model has constant variance of noise.
**(c)** Check the normality assumption for this model. Do you feel it has been violated? Justify your answer.
```{r}
shptst_pval = shapiro.test(resid(lpsa_add_model))$p.value
```
The p-value of the shapiro wilk test is `r shptst_pval` which is high and at an alpha of 0.05, we failed to reject the null hypothesis, which means the model is of normal distribution.
**(d)** Check for any high leverage observations. Report any observations you determine to have high leverage.
```{r}
prostate_data_idx = as.vector( which(hatvalues(lpsa_add_model) > 2 * mean(hatvalues(lpsa_add_model)) ))
prostate[prostate_data_idx,]
```
There are total of `r length(prostate_data_idx)` with high leverage observations in the dataset.
**(e)** Check for any influential observations. Report any observations you determine to be influential.
```{r}
prostate_influential_idx = as.vector( which(cooks.distance(lpsa_add_model) > 4 / length(cooks.distance(lpsa_add_model)) ) )
prostate_influential_idx
```
The observations `r prostate_influential_idx` in the dataset happens to be influential observations.
**(f)** Refit the additive multiple regression model without any points you identified as influential. Compare the coefficients of this fitted model to the previously fitted model.
```{r}
prostate_data_no_inf = prostate[-prostate_influential_idx,]
lpsa_add_removed_model = lm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data = prostate_data_no_inf)
coef(lpsa_add_model)
coef(lpsa_add_removed_model)
```
After removing the influential observations, the coefficients got impacted. The intercept became negative and other coefficients are different with significant variation.
**(g)** Create a data frame that stores the observations that were "removed" because they were influential. Use the two models you have fit to make predictions with these observations. Comment on the difference between these two sets of predictions.
```{r}
prostate_data_inf = prostate[prostate_influential_idx,]
x = predict(lpsa_add_model,newdata = prostate_data_inf)
y = predict(lpsa_add_removed_model,newdata = prostate_data_inf)
```
The prediction with the removed observations dataset for the original model : `r x`
The prediction with the removed observations dataset for the model with removed influential observations : `r y`
***
## Exercise 3 (Why Bother?)
**Why** do we care about violations of assumptions? One key reason is that the distributions of the parameter esimators that we have used are all reliant on these assumptions. When the assumptions are violated, the distributional results are not correct, so our tests are garbage. **Garbage In, Garbage Out!**
Consider the following setup that we will use for the remainder of the exercise. We choose a sample size of 50.
```{r}
n = 50
set.seed(420)
x_1 = runif(n, 0, 5)
x_2 = runif(n, -2, 2)
```
Consider the model,
\[
Y = 4 + 1 x_1 + 0 x_2 + \epsilon.
\]
That is,
- $\beta_0$ = 4
- $\beta_1$ = 1
- $\beta_2$ = 0
We now simulate `y_1` in a manner that does **not** violate any assumptions, which we will verify. In this case $\epsilon \sim N(0, 1).$
```{r}
set.seed(1)
y_1 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = 1)
fit_1 = lm(y_1 ~ x_1 + x_2)
bptest(fit_1)
```
Then, we simulate `y_2` in a manner that **does** violate assumptions, which we again verify. In this case $\epsilon \sim N(0, \sigma = |x_2|).$
```{r}
set.seed(1)
y_2 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = abs(x_2))
fit_2 = lm(y_2 ~ x_1 + x_2)
bptest(fit_2)
```
**(a)** Use the following code after changing `birthday` to your birthday.
```{r}
num_sims = 2500
p_val_1 = rep(0, num_sims)
p_val_2 = rep(0, num_sims)
birthday = 19830502
set.seed(birthday)
```
Repeat the above process of generating `y_1` and `y_2` as defined above, and fit models with each as the response `2500` times. Each time, store the p-value for testing,
\[
\beta_2 = 0,
\]
using both models, in the appropriate variables defined above. (You do not need to use a data frame as we have in the past. Although, feel free to modify the code to instead use a data frame.)
**(b)** What proportion of the `p_val_1` values is less than 0.01? Less than 0.05? Less than 0.10? What proportion of the `p_val_2` values is less than 0.01? Less than 0.05? Less than 0.10? Arrange your results in a table. Briefly explain these results.
```{r}
library(flextable)
for(i in 1:num_sims) {
y_1 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = 1)
fit_1 = lm(y_1 ~ x_1 + x_2)
y_2 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = abs(x_2))
fit_2 = lm(y_2 ~ x_1 + x_2)
p_val_1[i] = bptest(fit_1)$p.value
p_val_2[i] = bptest(fit_2)$p.value
}
df_p_val = data.frame( "p-value" = c("<0.01","<0.05","<0.10"),"y_1" = c(mean(p_val_1<0.01), mean(p_val_1<0.05),mean(p_val_1<0.10)), "y_2" = c(mean(p_val_2<0.01),mean(p_val_2<0.05),mean(p_val_2<0.10)) )
flextable(df_p_val)
```
Based on above table results, the proportion of p-value for model y_1 is close to p-value whereas the proportion of p-value for model y_2 is above the p-value which implies equal variance assumption for model y_2 is under suspect.
***
## Exercise 4 (Corrosion Data)
For this exercise, we will use the `corrosion` data, which can be found in the `faraway` package. After loading the `faraway` package, use `?corrosion` to learn about this dataset.
```{r, message = FALSE, warning = FALSE}
library(faraway)
```
**(a)** Fit a simple linear regression with `loss` as the response and `Fe` as the predictor. Plot a scatterplot and add the fitted line. Check the assumptions of this model.
```{r}
fit = lm(loss~Fe,data=corrosion)
plot(corrosion$Fe,corrosion$loss,xlab = "Iron%",ylab = "Weight Loss in mg",main="Weight Loss vs. Iron%",col = "orange")
abline(fit, col = "red", lwd = 2);
```
**Checking for assumption using graph plots:**
```{r fig.height=5, fig.width=10}
par(mfrow = c(1,2))
plot(fitted(fit),resid(fit),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
abline(h=0,col="red",lwd=2,lty=2)
qqnorm(resid(fit),col = "orange",main = "Normal Q-Q Plot")
qqline(resid(fit),lty=2,lwd=2,col = "red")
```
Looking at the plots, the fitted vs.residual plot seems to be ok whereas the Q-Q plot is not ok, some of the data points are not aligned to the fitted line which suggests equal variance assumption is ok whereas normality is under suspect.
**Performing BP-Test and shapiro-wilk Test**
```{r}
bptest_pval = bptest(fit)$p.value
shptest_pval = shapiro.test(resid(fit))$p.value
```
Looking at the bp test, the p-value `r bptest_pval` is high and also the p-value for shapiro wilk test `r shptest_pval` is also high which suggests we failed to reject the null hypothesis in both the cases so the model has equal variance and normal distribution.
**(b)** Fit higher order polynomial models of degree 2, 3, and 4. For each, plot a fitted versus residuals plot and comment on the constant variance assumption. Based on those plots, which of these three models do you think are acceptable? Use a statistical test(s) to compare the models you just chose. Based on the test, which is preferred? Check the normality assumption of this model. Identify any influential observations of this model.
```{r fig.height=5, fig.width=10}
fit_poly_deg2 = lm(loss~Fe+I(Fe^2),data = corrosion)
fit_poly_deg3 = lm(loss~Fe+I(Fe^2)+I(Fe^3),data = corrosion)
fit_poly_deg4 = lm(loss~Fe+I(Fe^2)+I(Fe^3)+I(Fe^4),data = corrosion)
par(mfrow = c(1,3))
plot(fitted(fit_poly_deg2),resid(fit_poly_deg2),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 2")
abline(h=0,col="red",lwd=2,lty=2)
plot(fitted(fit_poly_deg3),resid(fit_poly_deg3),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 3")
abline(h=0,col="red",lwd=2,lty=2)
plot(fitted(fit_poly_deg4),resid(fit_poly_deg4),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 4")
abline(h=0,col="red",lwd=2,lty=2)
```
```{r}
bptest_pval_deg2 = bptest(fit_poly_deg2)$p.value
bptest_pval_deg3 = bptest(fit_poly_deg3)$p.value
bptest_pval_deg4 = bptest(fit_poly_deg4)$p.value
shptest_pval_deg2 = shapiro.test(resid(fit_poly_deg2))$p.value
shptest_pval_deg3 = shapiro.test(resid(fit_poly_deg3))$p.value
shptest_pval_deg4 = shapiro.test(resid(fit_poly_deg4))$p.value
df_stat_test_results = data.frame("Model polynomial" = c("degree = 2","degree = 3","degree = 4"),"BP-Test - p value" = c(bptest_pval_deg2,bptest_pval_deg3,bptest_pval_deg4),"Shapiro-Wilk Test- p value" = c(shptest_pval_deg2,shptest_pval_deg3,shptest_pval_deg4))
flextable(df_stat_test_results)
```
From the above table results, we failed to reject the null hypothesis for all three models with polynomial of degree 2,3 and 4. But looking at the plots for fitted vs. residuals for all 3 models, the model with polynomial of degree 3 appears to be better than the models with polynomial of degree 2 and 4.
```{r}
cooks.distance(fit_poly_deg3) > 4 / length(cooks.distance(fit_poly_deg3))
```
**There are no influential data points in polynomial model of degree 3**
***
## Exercise 5 (Diamonds)
The data set `diamonds` from the `ggplot2` package contains prices and characteristics of 54,000 diamonds. For this exercise, use `price` as the response variable $y$, and `carat` as the predictor $x$. Use `?diamonds` to learn more.
```{r, message = FALSE, warning = FALSE}
library(ggplot2)
```
**(a)** Fit a linear model with `price` as the response variable $y$, and `carat` as the predictor $x$. Return the summary information of this model.
```{r}
diamond_model = lm(price~carat,data =diamonds )
summary(diamond_model)
```
**(b)** Plot a scatterplot of price versus carat and add the line for the fitted model in part
```{r fig.height=5, fig.width=10}
plot(diamonds$carat,diamonds$price,main = "Price vs. Carat",col = "orange",xlab = "Carat",ylab = "Price")
abline(diamond_model, col = "red", lwd = 2)
```
**(a)**. Using a fitted versus residuals plot and/or a Q-Q plot, comment on the diagnostics.
```{r fig.height=5, fig.width=10}
par(mfrow = c(1,2))
plot(fitted(diamond_model),resid(diamond_model),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 2")
abline(h=0,col="red",lwd=2,lty=2)
qqnorm(resid(diamond_model),col = "orange",main = "Normal Q-Q Plot")
qqline(resid(diamond_model),lty=2,lwd=2,col = "red")
```
**From the above plot, both equal variance and normality assumptions are under suspect**
**(c)** Seeing as the price stretches over several orders of magnitude, it seems reasonable to try a log transformation of the response. Fit a model with a logged response, plot a scatterplot of log-price versus carat and add the line for the fitted model, then use a fitted versus residuals plot and/or a Q-Q plot to comment on the diagnostics of the model.
```{r fig.height=5, fig.width=10}
diamond_log_model = lm(log(price) ~ carat, data = diamonds);
plot(diamonds$carat,log(diamonds$price),col = "orange",main = "Price vs. Carat",xlab="Carat",ylab="Price")
abline(diamond_log_model, col = "red", lwd = 2)
par(mfrow = c(1,2))
plot(fitted(diamond_log_model),resid(diamond_log_model),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
abline(h=0,col="red",lwd=2,lty=2)
qqnorm(resid(diamond_log_model),col = "orange",main = "Normal Q-Q Plot")
qqline(resid(diamond_log_model),lty=2,lwd=2,col = "red")
```
**From the above plot, both equal variance and normality assumptions are still under suspect**
```{r}
qplot(price, data = diamonds, bins = 30)
```
**(d)** Try adding log transformation of the predictor. Fit a model with a logged response and logged predictor, plot a scatterplot of log-price versus log-carat and add the line for the fitted model, then use a fitted versus residuals plot and/or a Q-Q plot to comment on the diagnostics of the model.
```{r fig.height=5, fig.width=10}
diamond_all_log_model = lm(log(price) ~ log(carat), data = diamonds);
plot(log(diamonds$carat),log(diamonds$price),col = "orange",main = "Price vs. Carat",xlab="Carat",ylab="Price")
abline(diamond_all_log_model, col = "red", lwd = 2)
par(mfrow = c(1,2))
plot(fitted(diamond_all_log_model),resid(diamond_all_log_model),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
abline(h=0,col="red",lwd=2,lty=2)
qqnorm(resid(diamond_all_log_model),col = "orange",main = "Normal Q-Q Plot")
qqline(resid(diamond_all_log_model),lty=2,lwd=2,col = "red")
```
**From the above plot, both equal variance and normality seem to be much better**
**(e)** Use the model from part **(d)** to predict the price (in dollars) of a 3-carat diamond. Construct a 99% prediction interval for the price (in dollars).
```{r}
x = data.frame("carat" = 3)
price_lwr = exp(predict(diamond_all_log_model, newdata = x, interval = "prediction", level = 0.99)[,"lwr"])
price_upr = exp(predict(diamond_all_log_model, newdata = x, interval = "prediction", level = 0.99)[,"upr"])
```
We are 99% confident that a new observation for carat of 3 will fall between `r price_lwr` and `r price_upr`