-
Notifications
You must be signed in to change notification settings - Fork 1
/
w02-hw-balajis2.Rmd
404 lines (286 loc) · 14.7 KB
/
w02-hw-balajis2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
---
title: "Week 2 - Homework"
author: "STAT 420, Summer 2018, BALAJI SATHYAMURTHY (BALAJIS2)"
date: ''
output:
html_document:
toc: yes
pdf_document: default
urlcolor: cyan
---
# Directions
- Be sure to remove this section if you use this `.Rmd` file as a template.
- You may leave the questions in your final document.
***
## Exercise 1 (Using `lm`)
For this exercise we will use the `cats` dataset from the `MASS` package. You should use `?cats` to learn about the background of this dataset.
**(a)** Suppose we would like to understand the size of a cat's heart based on the body weight of a cat. Fit a simple linear model in `R` that accomplishes this task. Store the results in a variable called `cat_model`. Output the result of calling `summary()` on `cat_model`.
```{r}
library(MASS)
cat_model = lm(Hwt~Bwt,data = cats)
summary(cat_model)
```
**(b)** Output only the estimated regression coefficients. Interpret $\hat{\beta_0}$ and $\beta_1$ in the *context of the problem*. Be aware that only one of those is an estimate.
```{r}
cat_model$coefficients[1]
cat_model$coefficients[2]
```
**(c)** Use your model to predict the heart weight of a cat that weights **2.7** kg. Do you feel confident in this prediction? Briefly explain.
```{r}
predict(cat_model,newdata = data.frame(Bwt =2.7))
range(cats$Bwt)
```
**The min and max range of body weight is 2.0 and 3.9 and 2.7 is within this range, so the model is somewhat confident since its not extrapolated**
**(d)** Use your model to predict the heart weight of a cat that weights **4.4** kg. Do you feel confident in this prediction? Briefly explain.
```{r}
predict(cat_model,newdata = data.frame(Bwt =4.4))
range(cats$Bwt)
```
**The min and max range of body weight is 2.0 and 3.9 and 4.4 not is within this range, so the model is not confident since its extrapolated**
**(e)** Create a scatterplot of the data and add the fitted regression line. Make sure your plot is well labeled and is somewhat visually appealing.
```{r}
plot(Hwt~Bwt,data = cats,xlab="Body weight",ylab="Heart weight",main = "cats heart weight vs. body weight analysis",col = "grey")
abline(cat_model,lwd=3,col = "darkorange")
```
**(f)** Report the value of $R^2$ for the model. Do so directly. Do not simply copy and paste the value from the full output in the console after running `summary()` in part **(a)**.
```{r}
summary(cat_model)$r.squared
```
***
## Exercise 2 (Writing Functions)
This exercise is a continuation of Exercise 1.
**(a)** Write a function called `get_sd_est` that calculates an estimate of $\sigma$ in one of two ways depending on input to the function. The function should take three arguments as input:
- `fitted_vals` - A vector of fitted values from a model
- `actual_vals` - A vector of the true values of the response
- `mle` - A logical (`TRUE` / `FALSE`) variable which defaults to `FALSE`
The function should return a single value:
- $s_e$ if `mle` is set to `FALSE`.
- $\hat{\sigma}$ if `mle` is set to `TRUE`.
```{r}
get_sd_est = function(fitted_vals,actual_vals,mle = FALSE){
sum_x = sum((actual_vals-fitted_vals)^2)
if (mle == FALSE){
n = length(actual_vals)-2
}else {
n = length(actual_vals)
}
sqrt(sum_x/n)
}
```
**(b)** Run the function `get_sd_est` on the residuals from the model in Exercise 1, with `mle` set to `FALSE`. Explain the resulting estimate in the context of the model.
```{r}
get_sd_est(cat_model$fitted.values,cats$Hwt)
```
**The model will have a difference of 1.45 grams in weight**
**(c)** Run the function `get_sd_est` on the residuals from the model in Exercise 1, with `mle` set to `TRUE`. Explain the resulting estimate in the context of the model. Note that we are trying to estimate the same parameter as in part **(b)**.
```{r}
get_sd_est(cat_model$fitted.values,cats$Hwt,TRUE)
```
**The model will have a difference of 1.44 grams in weight**
**(d)** To check your work, output `summary(cat_model)$sigma`. It should match at least one of **(b)** or **(c)**.
```{r}
summary(cat_model)$sigma
```
***
## Exercise 3 (Simulating SLR)
Consider the model
\[
Y_i = 5 + -3 x_i + \epsilon_i
\]
with
\[
\epsilon_i \sim N(\mu = 0, \sigma^2 = 10.24)
\]
where $\beta_0 = 5$ and $\beta_1 = -3$.
This exercise relies heavily on generating random observations. To make this reproducible we will set a seed for the randomization. Alter the following code to make `birthday` store your birthday in the format: `yyyymmdd`. For example, [William Gosset](https://en.wikipedia.org/wiki/William_Sealy_Gosset), better known as *Student*, was born on June 13, 1876, so he would use:
```{r}
birthday = 18760613
set.seed(birthday)
```
**(a)** Use `R` to simulate `n = 25` observations from the above model. For the remainder of this exercise, use the following "known" values of $x$.
You may use [the `sim_slr ` function provided in the text](http://daviddalpiaz.github.io/appliedstats/simple-linear-regression.html#simulating-slr). Store the data frame this function returns in a variable of your choice. Note that this function calls $y$ `response` and $x$ `predictor`.
```{r}
x = runif(n = 25, 0, 10)
num_obs = 25
beta_0 = 5
beta_1 =-3
sigma = sqrt(10.24)
epsilon = rnorm(n = num_obs,mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + epsilon
data.frame(predictor =x,response = y)
```
**(b)** Fit a model to your simulated data. Report the estimated coefficients. Are they close to what you would expect? Briefly explain.
```{r}
sim_fit = lm(y ~ x)
coef(sim_fit)
```
**The intercept and slope of the simulation model, 4.95 and -2.91 is somewhat close to 5 and -3, so the simulated model is close to actual model**
**(c)** Plot the data you simulated in part **(a)**. Add the regression line from part **(b)** as well as the line for the true model. Hint: Keep all plotting commands in the same chunk.
```{r}
plot(y~x,xlab = "sim_x",ylab = "sim_y", main = "sim_x vs. sim_y", col = "blue")
abline(sim_fit,col = "red")
```
**(d)** Use `R` to repeat the process of simulating `n = 25` observations from the above model $1500$ times. Each time fit a SLR model to the data and store the value of $\hat{\beta_1}$ in a variable called `beta_hat_1`. Some hints:
- Consider a `for` loop.
- Create `beta_hat_1` before writing the `for` loop. Make it a vector of length $1500$ where each element is `0`.
- Inside the body of the `for` loop, simulate new $y$ data each time. Use a variable to temporarily store this data together with the known $x$ data as a data frame.
- After simulating the data, use `lm()` to fit a regression. Use a variable to temporarily store this output.
- Use the `coef()` function and `[]` to extract the correct estimated coefficient.
- Use `beta_hat_1[i]` to store in elements of `beta_hat_1`.
- See the notes on [Distribution of a Sample Mean](http://daviddalpiaz.github.io/appliedstats/introduction-to-r.html#distribution-of-a-sample-mean) for some inspiration.
You can do this differently if you like. Use of these hints is not required.
```{r}
beta_hat_1 = rep(0,1500)
for (i in seq_along(beta_hat_1)) {
x = runif(n = 25, 0, 10)
num_obs = 25
beta_0 = 5
beta_1 =-3
sigma = sqrt(10.24)
epsilon = rnorm(n = num_obs,mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + epsilon
sim_fit = lm(y ~ x)
beta_hat_1[i] = coef(sim_fit)[2]
}
```
**(e)** Report the mean and standard deviation of `beta_hat_1`. Do either of these look familiar?
```{r}
mean(beta_hat_1)
sd(beta_hat_1)
```
**mean(beta_hat_1) is almost same as beta_1**
**(f)** Plot a histogram of `beta_hat_1`. Comment on the shape of this histogram.
```{r}
hist(beta_hat_1,col="blue")
```
**The shape is symmetric with center at 3**
***
## Exercise 4 (Be a Skeptic)
Consider the model
\[
Y_i = 3 + 0 \cdot x_i + \epsilon_i
\]
with
\[
\epsilon_i \sim N(\mu = 0, \sigma^2 = 4)
\]
where $\beta_0 = 3$ and $\beta_1 = 0$.
Before answering the following parts, set a seed value equal to **your** birthday, as was done in the previous exercise.
```{r}
birthday = 18760613
set.seed(birthday)
```
**(a)** Use `R` to repeat the process of simulating `n = 75` observations from the above model $2500$ times. For the remainder of this exercise, use the following "known" values of $x$.
```{r}
x = runif(n = 75, 0, 10)
beta_hat_1 = rep(0,2500)
for (i in seq_along(beta_hat_1)) {
x = runif(n = 75, 0, 10)
num_obs = 75
beta_0 = 3
beta_1 = 0
sigma = sqrt(4)
epsilon = rnorm(n = num_obs,mean = 0, sd = sigma)
y = beta_0 + beta_1 * x + epsilon
sim_fit = lm(y ~ x)
beta_hat_1[i] = coef(sim_fit)[2]
}
```
Each time fit a SLR model to the data and store the value of $\hat{\beta_1}$ in a variable called `beta_hat_1`. You may use [the `sim_slr ` function provided in the text](http://daviddalpiaz.github.io/appliedstats/simple-linear-regression.html#simulating-slr). Hint: Yes $\beta_1 = 0$.
**(b)** Plot a histogram of `beta_hat_1`. Comment on the shape of this histogram.
```{r}
hist(beta_hat_1,col="blue")
```
**The shape is symmetric with center at 0**
**(c)** Import the data in [`skeptic.csv`](skeptic.csv) and fit a SLR model. The variable names in `skeptic.csv` follow the same convention as those returned by `sim_slr()`. Extract the fitted coefficient for $\beta_1$.
```{r include=FALSE}
library(readr)
skeptic <- read_csv("skeptic.csv")
```
```{r}
skeptic_model = lm(response~predictor,data = skeptic)
c = coef(skeptic_model)[2]
c
```
**(d)** Re-plot the histogram from **(b)**. Now add a vertical red line at the value of $\hat{\beta_1}$ in part **(c)**. To do so, you'll need to use `abline(v = c, col = "red")` where `c` is your value.
```{r}
hist(beta_hat_1,col="blue")
abline(v=c,col="red")
```
**(e)** Your value of $\hat{\beta_1}$ in **(c)** should be negative. What proportion of the `beta_hat_1` values is smaller than your $\hat{\beta_1}$? Return this proportion, as well as this proportion multiplied by `2`.
```{r}
n = 0
for (i in seq_along(beta_hat_1)) {
if (beta_hat_1[i] < c){
n = n + 1}
}
n/length(beta_hat_1)
2*(n/length(beta_hat_1))
```
**(f)** Based on your histogram and part **(e)**, do you think the [`skeptic.csv`](skeptic.csv) data could have been generated by the model given above? Briefly explain.
***
## Exercise 5 (Comparing Models)
For this exercise we will use the `Ozone` dataset from the `mlbench` package. You should use `?Ozone` to learn about the background of this dataset. You may need to install the `mlbench` package. If you do so, do not include code to install the package in your `R` Markdown document.
For simplicity, we will perform some data cleaning before proceeding.
```{r}
library(mlbench)
data(Ozone, package = "mlbench")
Ozone = Ozone[, c(4, 6, 7, 8)]
colnames(Ozone) = c("ozone", "wind", "humidity", "temp")
Ozone = Ozone[complete.cases(Ozone), ]
```
We have:
- Loaded the data from the package
- Subset the data to relevant variables
- This is not really necessary (or perhaps a good idea) but it makes the next step easier
- Given variables useful names
- Removed any observation with missing values
- This should be given much more thought in practice
For this exercise we will define the "Root Mean Square Error" of a model as
\[
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}.
\]
**(a)** Fit three SLR models, each with "ozone" as the response. For the predictor, use "wind speed," "humidity percentage," and "temperature" respectively. For each, calculate $\text{RMSE}$ and $R^2$. Arrange the results in a markdown table, with a row for each model. Suggestion: Create a data frame that stores the results, then investigate the `kable()` function from the `knitr` package.
```{r}
ozone_model1 = lm(ozone~wind,data = Ozone)
RMSE_model1 = sqrt( sum((Ozone$ozone-ozone_model1$fitted.values)^2)/length(Ozone$wind))
Rsquared_model1 = summary(ozone_model1)$r.squared
ozone_model2 = lm(ozone~humidity,data = Ozone)
RMSE_model2 = sqrt( sum((Ozone$ozone-ozone_model2$fitted.values)^2)/length(Ozone$wind))
Rsquared_model2 = summary(ozone_model2)$r.squared
ozone_model3 = lm(ozone~temp,data = Ozone)
RMSE_model3 = sqrt( sum((Ozone$ozone-ozone_model3$fitted.values)^2)/length(Ozone$wind))
Rsquared_model3 = summary(ozone_model3)$r.squared
title = c("wind speed","humidity","temperature")
RMSEvalues = c(RMSE_model1,RMSE_model2,RMSE_model3)
Rsquaredvalues = c(Rsquared_model1,Rsquared_model2,Rsquared_model3)
output = data.frame(Predictor = title,RMSE = RMSEvalues,Rsquared = Rsquaredvalues)
library(knitr)
kable(output)
```
**(b)** Based on the results, which of the three predictors used is most helpful for predicting ozone readings? Briefly explain.
**Temperature is the most helpful for predicting ozone readings since the RMSE is low and also rsquared is the maximum**
***
## Exercise 00 (SLR without Intercept)
**This exercise will _not_ be graded and is simply provided for your information. No credit will be given for the completion of this exercise. Give it a try now, and be sure to read the solutions later.**
Sometimes it can be reasonable to assume that $\beta_0$ should be 0. That is, the line should pass through the point $(0, 0)$. For example, if a car is traveling 0 miles per hour, its stopping distance should be 0! (Unlike what we saw in the book.)
We can simply define a model without an intercept,
\[
Y_i = \beta x_i + \epsilon_i.
\]
**(a)** [In the **Least Squares Approach** section of the text](http://daviddalpiaz.github.io/appliedstats/simple-linear-regression.html#least-squares-approach) you saw the calculus behind the derivation of the regression estimates, and then we performed the calculation for the `cars` dataset using `R`. Here you need to do, but not show, the derivation for the slope only model. You should then use that derivation of $\hat{\beta}$ to write a function that performs the calculation for the estimate you derived.
In summary, use the method of least squares to derive an estimate for $\beta$ using data points $(x_i, y_i)$ for $i = 1, 2, \ldots n$. Simply put, find the value of $\beta$ to minimize the function
\[
f(\beta)=\sum_{i=1}^{n}(y_{i}-\beta x_{i})^{2}.
\]
Then, write a function `get_beta_no_int` that takes input:
- `x` - A predictor variable
- `y` - A response variable
The function should then output the $\hat{\beta}$ you derived for a given set of data.
**(b)** Write your derivation in your `.Rmd` file using TeX. Or write your derivation by hand, scan or photograph your work, and insert it into the `.Rmd` as an image. See the [RMarkdown documentation](http://rmarkdown.rstudio.com/) for working with images.
**(c)** Test your function on the `cats` data using body weight as `x` and heart weight as `y`. What is the estimate for $\beta$ for this data?
**(d)** Check your work in `R`. The following syntax can be used to fit a model without an intercept:
```{r, eval = FALSE}
lm(response ~ 0 + predictor, data = dataset)
```
Use this to fit a model to the `cat` data without an intercept. Output the coefficient of the fitted model. It should match your answer to **(c)**.