-
Notifications
You must be signed in to change notification settings - Fork 1
/
w04-hw-assign-balajis2.Rmd
507 lines (364 loc) · 21.1 KB
/
w04-hw-assign-balajis2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
---
title: "Week 4 - Homework"
author: "STAT 420, Summer 2019, BALAJI SATHYAMURTHY (BALAJIS2)"
date: ''
output:
html_document:
toc: yes
pdf_document: default
urlcolor: cyan
---
# Directions
- Be sure to remove this section if you use this `.Rmd` file as a template.
- You may leave the questions in your final document.
***
## Exercise 1 (Using `lm`)
For this exercise we will use the data stored in [`nutrition-2018.csv`](nutrition-2018.csv). It contains the nutritional values per serving size for a large variety of foods as calculated by the USDA in 2018. It is a cleaned version totaling 5956 observations and is current as of April 2018.
The variables in the dataset are:
- `ID`
- `Desc` - short description of food
- `Water` - in grams
- `Calories`
- `Protein` - in grams
- `Fat` - in grams
- `Carbs` - carbohydrates, in grams
- `Fiber` - in grams
- `Sugar` - in grams
- `Calcium` - in milligrams
- `Potassium` - in milligrams
- `Sodium` - in milligrams
- `VitaminC` - vitamin C, in milligrams
- `Chol` - cholesterol, in milligrams
- `Portion` - description of standard serving size used in analysis
**(a)** Fit the following multiple linear regression model in `R`. Use `Calories` as the response and `Fat`, `Sugar`, and `Sodium` as predictors.
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \epsilon_i.
\]
Here,
- $Y_i$ is `Calories`.
- $x_{i1}$ is `Fat`.
- $x_{i2}$ is `Sugar`.
- $x_{i3}$ is `Sodium`.
```{r include=FALSE}
library(readr)
nutrition_2018 = read_csv("nutrition-2018.csv")
```
```{r}
nutrition_2018_model = lm(Calories~Fat+Sugar+Sodium,data = nutrition_2018)
fstatistic_value = summary(nutrition_2018_model)$fstatistic[1]
p_value = anova(nutrition_2018_model)$'Pr(>F)'[1]
```
Use an $F$-test to test the significance of the regression. Report the following:
- The null and alternative hypotheses:
$H_0$: $\beta_0 = \beta_1 = \beta_2 = \beta_3 = 0$
$H_1$: Atleast one of $\beta_j$ is not zero where j = 1,2,3
- The value of the test statistic is **`r fstatistic_value`**
- The p-value of the test is **`r p_value`**
- A statistical decision at $\alpha = 0.01$: **Reject $H_0$ at $\alpha = 0.01$**
- A conclusion in the context of the problem: **There is a linear relationship between calories and atleast one of the predicators: Fat,sugar and sodium**
When reporting these, you should explicitly state them in your document, not assume that a reader will find and interpret them from a large block of `R` output.
**(b)** Output only the estimated regression coefficients. Interpret all $\hat{\beta}_j$ coefficients in the context of the problem.
```{r}
beta_hat_0 = summary(nutrition_2018_model)$coef["(Intercept)","Estimate"]
beta_hat_1 = summary(nutrition_2018_model)$coef["Fat","Estimate"]
beta_hat_2 = summary(nutrition_2018_model)$coef["Sugar","Estimate"]
beta_hat_3 = summary(nutrition_2018_model)$coef["Sodium","Estimate"]
```
The estimated regression coefficients for $\hat{\beta}_0$ is `r beta_hat_0`, $\hat{\beta}_1$ is `r beta_hat_1`, $\hat{\beta}_2$ is `r beta_hat_2` and $\hat{\beta}_3$ is `r beta_hat_3`
For 0 gram of fat,sugar and sodium, the estimated change in calories is `r beta_hat_0`
For 1 gram of fat with respect to sugar and sodium, the estimated change in calories is `r beta_hat_1`
For 1 gram of sugar with respect to fat and sodium, the estimated change in calories is `r beta_hat_2`
For 1 gram of sodium with respect to fat and sugar, the estimated change in calories is `r beta_hat_3`
**(c)** Use your model to predict the number of `Calories` in a Big Mac. According to [McDonald's publicized nutrition facts](https://www.mcdonalds.com/us/en-us/about-our-food/nutrition-calculator.html), the Big Mac contains 28g of fat, 9g of sugar, and 950mg of sodium.
```{r}
bigmac_data = data.frame(Fat=28,Sugar=9,Sodium=950)
output = predict(nutrition_2018_model,newdata = bigmac_data,interval = "confidence")
```
The predicted number of calories in a big mac is `r output[1,1]`
**(d)** Calculate the standard deviation, $s_y$, for the observed values in the Calories variable. Report the value of $s_e$ from your multiple regression model. Interpret both estimates in the context of this problem.
```{r}
Sy = sd(nutrition_2018$Calories)
Se = summary(nutrition_2018_model)$sigma
```
The standard deviation is `r Sy` and it tells us how the calorie observed values varies about its mean.
The standard error is `r Se` and it tells us how the calorie observed values varies about the fitted regression line
**(e)** Report the value of $R^2$ for the model. Interpret its meaning in the context of the problem.
```{r}
rsquare = summary(nutrition_2018_model)$r.squared
```
The $R^2$ is `r rsquare` and it tells the proportion of the variance of calories with respect to fat, sugar and sodium
**(f)** Calculate a 95% confidence interval for $\beta_2$. Give an interpretation of the interval in the context of the problem.
```{r}
confint(nutrition_2018_model, level = 0.95)[3,]
```
We are 95% confident that the calories for increase in sugar by 1 gram with respect to fat and sodium is in the above data interval.
**(g)** Calculate a 99% confidence interval for $\beta_0$. Give an interpretation of the interval in the context of the problem.
```{r}
confint(nutrition_2018_model, level = 0.99)[1,]
```
We are 99% confident that a calories is in the above interval for 0 grams fat, sugar and sodium.
**(h)** Use a 90% confidence interval to estimate the mean Calorie content of a food with 24g of fat, 0g of sugar, and 350mg of sodium, which is true of a large order of McDonald's french fries. Interpret the interval in context.
```{r}
large_order = data.frame(Fat = 24, Sugar = 0, Sodium = 350)
predict(nutrition_2018_model, newdata = large_order, interval = "confidence", level = 0.90)
```
We are 90% confident for 24 grams fat, 0 grams sugar and 350 grams sodium the calories is in the above interval.
**(i)** Use a 90% prediction interval to predict the Calorie content of a Taco Bell Crunchwrap Supreme that has 21g of fat, 6g of sugar, and 1200mg of sodium. Interpret the interval in context.
```{r}
tacobelldata = data.frame(Fat = 21, Sugar = 6, Sodium = 1200)
predict(nutrition_2018_model, newdata = tacobelldata, interval = "prediction", level = 0.90)
```
We are 90% confident for 21 grams fat, 6 grams sugar and 1200 grams sodium the calories is in the above interval.
***
## Exercise 2 (More `lm` for Multiple Regression)
For this exercise we will use the data stored in [`goalies.csv`](goalies.csv). It contains career data for 462 players in the National Hockey League who played goaltender at some point up to and including the 2014-2015 season. The variables in the dataset are:
- `W` - Wins
- `GA` - Goals Against
- `SA` - Shots Against
- `SV` - Saves
- `SV_PCT` - Save Percentage
- `GAA` - Goals Against Average
- `SO` - Shutouts
- `MIN` - Minutes
- `PIM` - Penalties in Minutes
For this exercise we will consider three models, each with Wins as the response. The predictors for these models are:
- Model 1: Goals Against, Saves
- Model 2: Goals Against, Saves, Shots Against, Minutes, Shutouts
- Model 3: All Available
```{r include=FALSE}
library(readr)
goalies = read_csv("goalies.csv")
```
```{r}
goalies_model1 = lm(W~GA+SV,data = goalies)
goalies_model2 = lm(W~GA+SV+SA+MIN+SO,data = goalies)
goalies_model3 = lm(W~GA+SV+SA+MIN+SO+SV_PCT+GAA+PIM,data = goalies)
```
**(a)** Use an $F$-test to compares Models 1 and 2. Report the following:
```{r}
x = anova(goalies_model1,goalies_model2)
x
```
- The null hypothesis: $H_0$: $\beta_3 = \beta_4 = \beta_5 = 0$ where 3 is for SA,4 is for MIN and 5 is for SO
- The value of the test statistic is `r x$F[2]`
- The p-value of the test is `r format(x$'Pr(>F)'[2],scientific = TRUE)`
- A statistical decision at $\alpha = 0.05$: Reject $H_0$ at $\alpha$ = 0.05
- The model you prefer: goalies_model2
**(b)** Use an $F$-test to compare Model 3 to your preferred model from part **(a)**. Report the following:
```{r}
y = anova(goalies_model2,goalies_model3)
y
```
- The null hypothesis: - The null hypothesis: $H_0$: $\beta_6 = \beta_7 = \beta_8 = 0$ where 6 is for SV_PCT,7 is for GAA and 8 is for PIM
- The value of the test statistic is `r y$F[2]`
- The p-value of the test is `r format(y$'Pr(>F)'[2],scientific = TRUE)`
- A statistical decision at $\alpha = 0.05$: Reject $H_0$ at $\alpha$ = 0.05
- The model you prefer: goalies_model3
**(c)** Use a $t$-test to test $H_0: \beta_{\texttt{SV}} = 0 \ \text{vs} \ H_1: \beta_{\texttt{SV}} \neq 0$ for the model you preferred in part **(b)**. Report the following:
```{r}
z = summary(goalies_model3)
tvalue = z$coef["SV","t value"]
pvalue = z$coef["SV","Pr(>|t|)"]
```
- The value of the test statistic is `r tvalue `
- The p-value of the test is `r format(pvalue,scientific = TRUE)`
- A statistical decision at $\alpha = 0.05$: Reject $H_0$ at $\alpha = 0.05$
***
## Exercise 3 (Regression without `lm`)
For this exercise we will once again use the `Ozone` data from the `mlbench` package. The goal of this exercise is to fit a model with `ozone` as the response and the remaining variables as predictors.
```{r}
data(Ozone, package = "mlbench")
Ozone = Ozone[, c(4, 6, 7, 8)]
colnames(Ozone) = c("ozone", "wind", "humidity", "temp")
Ozone = Ozone[complete.cases(Ozone), ]
```
**(a)** Obtain the estimated regression coefficients **without** the use of `lm()` or any other built-in functions for regression. That is, you should use only matrix operations. Store the results in a vector `beta_hat_no_lm`. To ensure this is a vector, you may need to use `as.vector()`. Return this vector as well as the results of `sum(beta_hat_no_lm ^ 2)`.
By using the below relation:
$\hat\beta$ = ($X^T$ $X)^{-1}$ $X^T$ y
```{r}
y = Ozone$ozone
X = cbind(rep(1,length(y)),Ozone$wind,Ozone$humidity,Ozone$temp)
(C = solve(t(X) %*% X) %*% t(X) %*% y)
beta_hat_no_lm = as.vector(C)
beta_hat_no_lm
sum(beta_hat_no_lm^2)
```
**(b)** Obtain the estimated regression coefficients **with** the use of `lm()`. Store the results in a vector `beta_hat_lm`. To ensure this is a vector, you may need to use `as.vector()`. Return this vector as well as the results of `sum(beta_hat_lm ^ 2)`.
```{r}
ozone_model = lm(ozone~.,data = Ozone)
ozone_beta0 = summary(ozone_model)$coefficients["(Intercept)","Estimate"]
ozone_beta1 = summary(ozone_model)$coefficients["wind","Estimate"]
ozone_beta2 = summary(ozone_model)$coefficients["humidity","Estimate"]
ozone_beta3 = summary(ozone_model)$coefficients["temp","Estimate"]
x = c(ozone_beta0,ozone_beta1,ozone_beta2,ozone_beta3)
beta_hat_lm = as.vector(x)
beta_hat_lm
sum(beta_hat_lm^2)
```
**(c)** Use the `all.equal()` function to verify that the results are the same. You may need to remove the names of one of the vectors. The `as.vector()` function will do this as a side effect, or you can directly use `unname()`.
```{r}
all.equal(beta_hat_no_lm,beta_hat_lm)
```
**(d)** Calculate $s_e$ without the use of `lm()`. That is, continue with your results from **(a)** and perform additional matrix operations to obtain the result. Output this result. Also, verify that this result is the same as the result obtained from `lm()`.
```{r}
n = length(y)
p = length(beta_hat_no_lm)
y_hat = X %*% beta_hat_no_lm
Se = sqrt( sum((y -y_hat)^2 ) / (n-p) )
Se
summary(ozone_model)$sigma
all.equal(Se,summary(ozone_model)$sigma)
```
**(e)** Calculate $R^2$ without the use of `lm()`. That is, continue with your results from **(a)** and **(d)**, and perform additional operations to obtain the result. Output this result. Also, verify that this result is the same as the result obtained from `lm()`.
```{r}
summary(ozone_model)$r.squared
```
***
## Exercise 4 (Regression for Prediction)
For this exercise use the `Auto` dataset from the `ISLR` package. Use `?Auto` to learn about the dataset. The goal of this exercise is to find a model that is useful for **predicting** the response `mpg`. We remove the `name` variable as it is not useful for this analysis. (Also, this is an easier to load version of data from the textbook.)
```{r}
# load required package, remove "name" variable
library(ISLR)
Auto = subset(Auto, select = -c(name))
set.seed(1)
auto_trn_idx = sample(1:nrow(Auto), 292)
auto_trn = Auto[auto_trn_idx, ]
auto_tst = Auto[-auto_trn_idx, ]
```
When evaluating a model for prediction, we often look at RMSE. However, if we both fit the model with all the data as well as evaluate RMSE using all the data, we're essentially cheating. We'd like to use RMSE as a measure of how well the model will predict on *unseen* data. If you haven't already noticed, the way we had been using RMSE resulted in RMSE decreasing as models became larger.
To correct for this, we will only use a portion of the data to fit the model, and then we will use leftover data to evaluate the model. We will call these datasets **train** (for fitting) and **test** (for evaluating). The definition of RMSE will stay the same
\[
\text{RMSE}(\text{model, data}) = \sqrt{\frac{1}{n} \sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}
\]
where
- $y_i$ are the actual values of the response for the given data.
- $\hat{y}_i$ are the predicted values using the fitted model and the predictors from the data.
However, we will now evaluate it on both the **train** set and the **test** set separately. So each model you fit will have a **train** RMSE and a **test** RMSE. When calculating **test** RMSE, the predicted values will be found by predicting the response using the **test** data with the model fit using the **train** data. *__Test__ data should never be used to fit a model.*
- Train RMSE: Model fit with *train* data. Evaluate on **train** data.
- Test RMSE: Model fit with *train* data. Evaluate on **test** data.
Set a seed of `1`, and then split the `Auto` data into two datasets, one called `auto_trn` and one called `auto_tst`. The `auto_trn` data frame should contain 292 randomly chosen observations. The `auto_tst` data will contain the remaining observations. Hint: consider the following code:
Fit a total of five models using the training data.
- One must use all possible predictors.
- One must use only `displacement` as a predictor.
- The remaining three you can pick to be anything you like. One of these should be the *best* of the five for predicting the response.
For each model report the **train** and **test** RMSE. Arrange your results in a well-formatted markdown table. Argue that one of your models is the best for predicting the response.
```{r}
Auto_model1 = lm(mpg~displacement,data = auto_trn) # Only displacement as predictor
Auto_model2 = lm(mpg~.,data = auto_trn) # All predictors
Auto_model3 = lm(mpg~displacement+acceleration+weight,data=auto_trn) #displacement,acceleration and weight as predictors
Auto_model4 = lm(mpg~displacement+acceleration+cylinders+horsepower+year,data = auto_trn) #displacement,acceleration,cylinders,horsepower and year as predictors
Auto_model5 = lm(mpg~displacement+acceleration+cylinders+horsepower+weight, data = auto_trn)#displacement,acceleration,cylinders,horsepower and weight as predictors
rmse_train_model1 = sqrt(mean( (auto_trn$mpg - predict(Auto_model1, auto_trn)) ^ 2) )
rmse_train_model2 = sqrt(mean( (auto_trn$mpg - predict(Auto_model2, auto_trn)) ^ 2) )
rmse_train_model3 = sqrt(mean( (auto_trn$mpg - predict(Auto_model3, auto_trn)) ^ 2) )
rmse_train_model4 = sqrt(mean( (auto_trn$mpg - predict(Auto_model4, auto_trn)) ^ 2) )
rmse_train_model5 = sqrt(mean( (auto_trn$mpg - predict(Auto_model5, auto_trn)) ^ 2) )
rmse_test_model1 = sqrt(mean( (auto_tst$mpg - predict(Auto_model1, auto_tst)) ^ 2) )
rmse_test_model2 = sqrt(mean( (auto_tst$mpg - predict(Auto_model2, auto_tst)) ^ 2) )
rmse_test_model3 = sqrt(mean( (auto_tst$mpg - predict(Auto_model3, auto_tst)) ^ 2) )
rmse_test_model4 = sqrt(mean( (auto_tst$mpg - predict(Auto_model4, auto_tst)) ^ 2) )
rmse_test_model5 = sqrt(mean( (auto_tst$mpg - predict(Auto_model5, auto_tst)) ^ 2) )
Model_No = c("Model1","Model2","Model3","Model4","Model5")
Train_RMSE = c(rmse_train_model1,rmse_train_model2,rmse_train_model3,rmse_train_model4,rmse_train_model5)
Test_RMSE = c(rmse_test_model1,rmse_test_model2,rmse_test_model3,rmse_test_model4,rmse_test_model5)
auto_output = data.frame(Model_No,Train_RMSE,Test_RMSE)
colnames(auto_output) = c("Model No","Train RMSE","Test RMSE")
library(flextable)
library(magrittr)
flextable(auto_output)
```
**Model 2 which uses all the predictors is the best model since the RMSE is less when compared to other 4 models**
***
## Exercise 5 (Simulating Multiple Regression)
For this exercise we will simulate data from the following model:
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5} + \epsilon_i
\]
Where $\epsilon_i \sim N(0, \sigma^2).$ Also, the parameters are known to be:
- $\beta_0 = 2$
- $\beta_1 = -0.75$
- $\beta_2 = 1.5$
- $\beta_3 = 0$
- $\beta_4 = 0$
- $\beta_5 = 2$
- $\sigma^2 = 25$
We will use samples of size `n = 42`.
We will verify the distribution of $\hat{\beta}_2$ as well as investigate some hypothesis tests.
**(a)** We will first generate the $X$ matrix and data frame that will be used throughout the exercise. Create the following nine variables:
- `x0`: a vector of length `n` that contains all `1`
- `x1`: a vector of length `n` that is randomly drawn from a normal distribution with a mean of `0` and a standard deviation of `2`
- `x2`: a vector of length `n` that is randomly drawn from a uniform distribution between `0` and `4`
- `x3`: a vector of length `n` that is randomly drawn from a normal distribution with a mean of `0` and a standard deviation of `1`
- `x4`: a vector of length `n` that is randomly drawn from a uniform distribution between `-2` and `2`
- `x5`: a vector of length `n` that is randomly drawn from a normal distribution with a mean of `0` and a standard deviation of `2`
- `X`: a matrix that contains `x0`, `x1`, `x2`, `x3`, `x4`, and `x5` as its columns
- `C`: the $C$ matrix that is defined as $(X^\top X)^{-1}$
- `y`: a vector of length `n` that contains all `0`
- `sim_data`: a data frame that stores `y` and the **five** *predictor* variables. `y` is currently a placeholder that we will update during the simulation.
Report the sum of the diagonal of `C` as well as the 5th row of `sim_data`. For this exercise we will use the seed `420`. Generate the above variables in the order listed after running the code below to set a seed.
```{r}
set.seed(420)
sample_size = 42
x0 = rep(1,sample_size)
x1 = rnorm(n = sample_size,mean = 0,sd = 2)
x2 = runif(n = sample_size,min = 0,max = 4)
x3 = rnorm(n = sample_size,mean = 0,sd = 1)
x4 = runif(n = sample_size,min = -2,max = 2)
x5 = rnorm(n = sample_size,mean = 0,sd = 2)
X = cbind(x0,x1,x2,x3,x4,x5)
C = solve(t(X) %*% X)
y = rep(0,sample_size)
sim_data = data.frame(y,x1,x2,x3,x4,x5)
diag_sum_C = sum(diag(C))
row5_simdata = sim_data[5,]
```
The sum of leading diagonal of C matrix is `r diag_sum_C`
The 5th row of the sim data is displayed below
`r row5_simdata`
**(b)** Create three vectors of length `2500` that will store results from the simulation in part **(c)**. Call them `beta_hat_1`, `beta_3_pval`, and `beta_5_pval`.
```{r}
beta_hat_1 = rep(0,2500)
beta_3_pval = rep(0,2500)
beta_5_pval = rep(0,2500)
```
**(c)** Simulate 2500 samples of size `n = 42` from the model above. Each time update the `y` value of `sim_data`. Then use `lm()` to fit a multiple regression model. Each time store:
- The value of $\hat{\beta}_1$ in `beta_hat_1`
- The p-value for the two-sided test of $\beta_3 = 0$ in `beta_3_pval`
- The p-value for the two-sided test of $\beta_5 = 0$ in `beta_5_pval`
```{r}
beta0 = 2
beta1 = -0.75
beta2 = 1.5
beta3 = 0
beta4 = 0
beta5 = 2
sigma = sqrt(25)
for(sim_cnt in 1:2500)
{
epsilon = rnorm(n = sample_size,mean = 0, sd = sigma)
y_value = beta0 * x0 + beta1 * x1 + beta2 * x2 + beta3 * x3 + beta4 * x4 + beta5 * x5 + epsilon
sim_data$y = y_value
sim_model = lm(y~.,data = sim_data)
beta_hat_1[sim_cnt] = summary(sim_model)$coefficients["x1","Estimate"]
beta_3_pval[sim_cnt] = summary(sim_model)$coefficients["x3","Pr(>|t|)"]
beta_5_pval[sim_cnt] = summary(sim_model)$coefficients["x5","Pr(>|t|)"]
}
```
**(d)** Based on the known values of $X$, what is the true distribution of $\hat{\beta}_1$?
$\hat\beta_1$ ~ N( $\beta_1$ , $\sigma^2$ $C_{11}$)
$\hat\beta_1$ ~ N( `r beta1`, `r sigma^2 * C[1+1,1+1]`)
**(e)** Calculate the mean and variance of `beta_hat_1`. Are they close to what we would expect? Plot a histogram of `beta_hat_1`. Add a curve for the true distribution of $\hat{\beta}_1$. Does the curve seem to match the histogram?
```{r}
mean(beta_hat_1)
var(beta_hat_1)
```
The mean of beta_hat_1 is `r mean(beta_hat_1)`
The variance of beta_hat_1 is `r var(beta_hat_1)`
```{r}
hist(beta_hat_1, prob = TRUE,xlab = "beta_hat1", main = "Distribution of beta_hat1", col = "blue")
curve(dnorm(x,beta1,sqrt(sigma^2 * C[1+1,1+1])), col = "orange", add = TRUE, lwd = 3)
```
**(f)** What proportion of the p-values stored in `beta_3_pval` is less than 0.10? Is this what you would expect?
The proportion of p-values stored in beta_3_pval which is less than 0.10 is `r mean(beta_3_pval < 0.10)`. No. since beta3 is zero the p-values should be very very small value.
**(g)** What proportion of the p-values stored in `beta_5_pval` is less than 0.01? Is this what you would expect?
The proportion of p-values stored in beta_3_pval which is less than 0.10 is `r mean(beta_5_pval < 0.01)`. Yes. since beta5 is non zero.