w08-hw-balajis2.Rmd

---
title: "Week 8 - Homework"
author: "STAT 420, Summer 2018, BALAJI SATHYAMURTHY (BALAJIS2)"
date: ''
output:
  html_document: 
    toc: yes
  pdf_document: default
urlcolor: cyan
---

***

```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
options(scipen = 1, digits = 4, width = 80, fig.alin = "center")
```

## Exercise 1 (Writing Functions)

**(a)** Write a function named `diagnostics` that takes as input the arguments:

- `model`, an object of class `lm()`, that is a model fit via `lm()`
- `pcol`, for controlling point colors in plots, with a default value of `grey`
- `lcol`, for controlling line colors in plots, with a default value of `dodgerblue`
- `alpha`, the significance level of any test that will be performed inside the function, with a default value of `0.05`
- `plotit`, a logical value for controlling display of plots with default value `TRUE`
- `testit`, a logical value for controlling outputting the results of tests with default value `TRUE`

The function should output:

- A list with two elements when `testit` is `TRUE`:
    - `p_val`, the p-value for the Shapiro-Wilk test for assessing normality
    - `decision`, the decision made when performing the Shapiro-Wilk test using the `alpha` value input to the function. "Reject" if the null hypothesis is rejected, otherwise "Fail to Reject."
- Two plots, side-by-side, when `plotit` is `TRUE`:
    - A fitted versus residuals plot that adds a horizontal line at $y = 0$, and labels the $x$-axis "Fitted" and the $y$-axis "Residuals." The points and line should be colored according to the input arguments. Give the plot a title. 
    - A Normal Q-Q plot of the residuals that adds the appropriate line using `qqline()`. The points and line should be colored according to the input arguments. Be sure the plot has a title. 

Consider using this function to help with the remainder of the assignment as well.

```{r}
diagnostics = function(model,pcol = "darkorange",lcol = "red",alpha = 0.05,plotit = TRUE,testit= TRUE){
  
  diag_return = data.frame(p_val = "0", decision = "0")
  
  if(testit){
    diag_return["p_val"] = shapiro.test(resid(model))$p.value
    if(shapiro.test(resid(model))$p.value < alpha){
      diag_return["decision"] = "Reject"
    }
    else{
      diag_return["decision"] = "Fail to Reject"
    }
  }
  
   if(plotit){
      par(mfrow = c(1,2))
      plot(fitted(model),resid(model),col=pcol,xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
      abline(h=0,col=lcol,lwd=2)
      qqnorm(resid(model),col = pcol,main = "Normal Q-Q Plot")
      qqline(resid(model),lty=2,lwd=2,col = lcol)
   } 
  
  diag_return
  
}
```


**(b)** Run the following code.

```{r}
set.seed(420)

data_1 = data.frame(x = runif(n = 30, min = 0, max = 10),
                    y = rep(x = 0, times = 30))
data_1$y = with(data_1, 2 + 1 * x + rexp(n = 30))
fit_1 = lm(y ~ x, data = data_1)

data_2 = data.frame(x = runif(n = 20, min = 0, max = 10),
                    y = rep(x = 0, times = 20))
data_2$y = with(data_2, 5 + 2 * x + rnorm(n = 20))
fit_2 = lm(y ~ x, data = data_2)

data_3 = data.frame(x = runif(n = 40, min = 0, max = 10),
                    y = rep(x = 0, times = 40))
data_3$y = with(data_3, 2 + 1 * x + rnorm(n = 40, sd = x))
fit_3 = lm(y ~ x, data = data_3)
```

```{r fig.height=5, fig.width=10}
diagnostics(fit_1, plotit = FALSE)$p_val
diagnostics(fit_2, plotit = FALSE)$decision
diagnostics(fit_1, testit = FALSE, pcol = "black", lcol = "black")
diagnostics(fit_2, testit = FALSE, pcol = "grey", lcol = "green")
diagnostics(fit_3)

```

***

## Exercise 2 (Prostate Cancer Data)

For this exercise, we will use the `prostate` data, which can be found in the `faraway` package. After loading the `faraway` package, use `?prostate` to learn about this dataset.

```{r, message = FALSE, warning = FALSE}
library(faraway)
```

**(a)** Fit an additive multiple regression model with `lpsa` as the response and the remaining variables in the `prostate` dataset as predictors. Report the $R^2$ value for this model.

```{r}
lpsa_add_model = lm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data = prostate)
lpsa_rsquared = summary(lpsa_add_model)$r.squared
```

The r-squared of the additive lpsa model is `r  lpsa_rsquared`

**(b)** Check the constant variance assumption for this model. Do you feel it has been violated? Justify your answer.

```{r}
library(lmtest)
bptest_pval = bptest(lpsa_add_model)$p.value
```

The p-value of the BP test is `r bptest_pval` which is high and at an alpha of 0.05, we failed to reject the null hypothesis, which means the model has constant variance of noise.

**(c)** Check the normality assumption for this model. Do you feel it has been violated? Justify your answer.

```{r}
shptst_pval = shapiro.test(resid(lpsa_add_model))$p.value
```
The p-value of the shapiro wilk test is `r shptst_pval` which is high and at an alpha of 0.05, we failed to reject the null hypothesis, which means the model is of normal distribution.

**(d)** Check for any high leverage observations. Report any observations you determine to have high leverage.

```{r}
prostate_data_idx = as.vector( which(hatvalues(lpsa_add_model) > 2 * mean(hatvalues(lpsa_add_model)) ))
prostate[prostate_data_idx,]
```

There are total of `r length(prostate_data_idx)` with high leverage observations in the dataset.

**(e)** Check for any influential observations. Report any observations you determine to be influential.

```{r}
prostate_influential_idx = as.vector( which(cooks.distance(lpsa_add_model) > 4 / length(cooks.distance(lpsa_add_model)) ) ) 
prostate_influential_idx
```
The observations `r prostate_influential_idx` in the dataset happens to be influential observations.

**(f)** Refit the additive multiple regression model without any points you identified as influential. Compare the coefficients of this fitted model to the previously fitted model.

```{r}
prostate_data_no_inf = prostate[-prostate_influential_idx,]
lpsa_add_removed_model = lm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data = prostate_data_no_inf)
coef(lpsa_add_model)
coef(lpsa_add_removed_model)
```
After removing the influential observations, the coefficients got impacted. The intercept became negative and other coefficients are different with significant variation.

**(g)** Create a data frame that stores the observations that were "removed" because they were influential. Use the two models you have fit to make predictions with these observations. Comment on the difference between these two sets of predictions.

```{r}
prostate_data_inf = prostate[prostate_influential_idx,]
x = predict(lpsa_add_model,newdata = prostate_data_inf)
y = predict(lpsa_add_removed_model,newdata = prostate_data_inf)
```

The prediction with the removed observations dataset for the original model : `r x`

The prediction with the removed observations dataset for the model with removed influential observations : `r y`

***

## Exercise 3 (Why Bother?)

**Why** do we care about violations of assumptions? One key reason is that the distributions of the parameter esimators that we have used are all reliant on these assumptions. When the assumptions are violated, the distributional results are not correct, so our tests are garbage. **Garbage In, Garbage Out!**

Consider the following setup that we will use for the remainder of the exercise. We choose a sample size of 50.

```{r}
n = 50
set.seed(420)
x_1 = runif(n, 0, 5)
x_2 = runif(n, -2, 2)
```

Consider the model,

\[
Y = 4 + 1 x_1 + 0 x_2 + \epsilon.
\]

That is,

- $\beta_0$ = 4
- $\beta_1$ = 1
- $\beta_2$ = 0

We now simulate `y_1` in a manner that does **not** violate any assumptions, which we will verify. In this case $\epsilon \sim N(0, 1).$

```{r}
set.seed(1)
y_1 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = 1)
fit_1 = lm(y_1 ~ x_1 + x_2)
bptest(fit_1)
```

Then, we simulate `y_2` in a manner that **does** violate assumptions, which we again verify. In this case $\epsilon \sim N(0, \sigma = |x_2|).$

```{r}
set.seed(1)
y_2 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = abs(x_2))
fit_2 = lm(y_2 ~ x_1 + x_2)
bptest(fit_2)
```

**(a)** Use the following code after changing `birthday` to your birthday.

```{r}
num_sims = 2500
p_val_1 = rep(0, num_sims)
p_val_2 = rep(0, num_sims)
birthday = 19830502
set.seed(birthday)
```

Repeat the above process of generating `y_1` and `y_2` as defined above, and fit models with each as the response `2500` times. Each time, store the p-value for testing,

\[
\beta_2 = 0,
\]

using both models, in the appropriate variables defined above. (You do not need to use a data frame as we have in the past. Although, feel free to modify the code to instead use a data frame.)

**(b)** What proportion of the `p_val_1` values is less than 0.01? Less than 0.05? Less than 0.10? What proportion of the `p_val_2` values is less than 0.01? Less than 0.05? Less than 0.10? Arrange your results in a table. Briefly explain these results.

```{r}
library(flextable)

for(i in 1:num_sims) {
  y_1 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = 1)
  fit_1 = lm(y_1 ~ x_1 + x_2)
  y_2 = 4 + 1 * x_1 + 0 * x_2 + rnorm(n = n, mean = 0, sd = abs(x_2))
  fit_2 = lm(y_2 ~ x_1 + x_2)
  
  p_val_1[i] = bptest(fit_1)$p.value
  p_val_2[i] = bptest(fit_2)$p.value
}

df_p_val = data.frame( "p-value" = c("<0.01","<0.05","<0.10"),"y_1" = c(mean(p_val_1<0.01), mean(p_val_1<0.05),mean(p_val_1<0.10)), "y_2" = c(mean(p_val_2<0.01),mean(p_val_2<0.05),mean(p_val_2<0.10)) )

flextable(df_p_val)
```

Based on above table results, the proportion of p-value for model y_1 is close to p-value whereas the proportion of p-value for model y_2 is above the p-value which implies equal variance assumption for model y_2 is under suspect.

***

## Exercise 4 (Corrosion Data)

For this exercise, we will use the `corrosion` data, which can be found in the `faraway` package. After loading the `faraway` package, use `?corrosion` to learn about this dataset.

```{r, message = FALSE, warning = FALSE}
library(faraway)
```

**(a)** Fit a simple linear regression with `loss` as the response and `Fe` as the predictor. Plot a scatterplot and add the fitted line. Check the assumptions of this model.

```{r}
fit = lm(loss~Fe,data=corrosion)
plot(corrosion$Fe,corrosion$loss,xlab = "Iron%",ylab = "Weight Loss in mg",main="Weight Loss vs. Iron%",col = "orange")
abline(fit, col = "red", lwd = 2);
```

**Checking for assumption using graph plots:**

```{r fig.height=5, fig.width=10}

      par(mfrow = c(1,2))
      plot(fitted(fit),resid(fit),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
      abline(h=0,col="red",lwd=2,lty=2)
      qqnorm(resid(fit),col = "orange",main = "Normal Q-Q Plot")
      qqline(resid(fit),lty=2,lwd=2,col = "red")

```

Looking at the plots, the fitted vs.residual plot seems to be ok whereas the Q-Q plot is not ok, some of the data points are not aligned to the fitted line which suggests equal variance assumption is ok whereas normality is under suspect.

**Performing BP-Test and shapiro-wilk Test**

```{r}
bptest_pval = bptest(fit)$p.value
shptest_pval = shapiro.test(resid(fit))$p.value
```

Looking at the bp test, the p-value `r bptest_pval` is high and also the p-value for shapiro wilk test `r shptest_pval` is also high which suggests we failed to reject the null hypothesis in both the cases so the model has equal variance and normal distribution.

**(b)** Fit higher order polynomial models of degree 2, 3, and 4. For each, plot a fitted versus residuals plot and comment on the constant variance assumption. Based on those plots, which of these three models do you think are acceptable? Use a statistical test(s) to compare the models you just chose. Based on the test, which is preferred? Check the normality assumption of this model. Identify any influential observations of this model.

```{r fig.height=5, fig.width=10}
fit_poly_deg2 = lm(loss~Fe+I(Fe^2),data = corrosion)
fit_poly_deg3 = lm(loss~Fe+I(Fe^2)+I(Fe^3),data = corrosion)
fit_poly_deg4 = lm(loss~Fe+I(Fe^2)+I(Fe^3)+I(Fe^4),data = corrosion)

      par(mfrow = c(1,3))
      plot(fitted(fit_poly_deg2),resid(fit_poly_deg2),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 2")
      abline(h=0,col="red",lwd=2,lty=2)
      plot(fitted(fit_poly_deg3),resid(fit_poly_deg3),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 3")
      abline(h=0,col="red",lwd=2,lty=2) 
      plot(fitted(fit_poly_deg4),resid(fit_poly_deg4),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 4")
      abline(h=0,col="red",lwd=2,lty=2)       
```

```{r}
bptest_pval_deg2 = bptest(fit_poly_deg2)$p.value
bptest_pval_deg3 = bptest(fit_poly_deg3)$p.value
bptest_pval_deg4 = bptest(fit_poly_deg4)$p.value
shptest_pval_deg2 = shapiro.test(resid(fit_poly_deg2))$p.value
shptest_pval_deg3 = shapiro.test(resid(fit_poly_deg3))$p.value
shptest_pval_deg4 = shapiro.test(resid(fit_poly_deg4))$p.value

df_stat_test_results = data.frame("Model polynomial" = c("degree = 2","degree = 3","degree = 4"),"BP-Test - p value" = c(bptest_pval_deg2,bptest_pval_deg3,bptest_pval_deg4),"Shapiro-Wilk Test- p value" = c(shptest_pval_deg2,shptest_pval_deg3,shptest_pval_deg4))

flextable(df_stat_test_results)
```

From the above table results, we failed to reject the null hypothesis for all three models with polynomial of degree 2,3 and 4. But looking at the plots for fitted vs. residuals for all 3 models, the model with polynomial of degree 3 appears to be better than the models with polynomial of degree 2 and 4.

```{r}
cooks.distance(fit_poly_deg3) > 4 / length(cooks.distance(fit_poly_deg3))
```

**There are no influential data points in polynomial model of degree 3**

***

## Exercise 5 (Diamonds)

The data set `diamonds` from the `ggplot2` package contains prices and characteristics of 54,000 diamonds. For this exercise, use `price` as the response variable $y$, and `carat` as the predictor $x$. Use `?diamonds` to learn more.

```{r, message = FALSE, warning = FALSE}
library(ggplot2)
```

**(a)** Fit a linear model with `price` as the response variable $y$, and `carat` as the predictor $x$. Return the summary information of this model.

```{r}
diamond_model = lm(price~carat,data =diamonds )
summary(diamond_model)
```

**(b)** Plot a scatterplot of price versus carat and add the line for the fitted model in part

```{r fig.height=5, fig.width=10}
plot(diamonds$carat,diamonds$price,main = "Price vs. Carat",col = "orange",xlab = "Carat",ylab = "Price")
abline(diamond_model, col = "red", lwd = 2)
```


**(a)**. Using a fitted versus residuals plot and/or a Q-Q plot, comment on the diagnostics. 

```{r fig.height=5, fig.width=10}
par(mfrow = c(1,2))  
      plot(fitted(diamond_model),resid(diamond_model),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot - Degree 2")
      abline(h=0,col="red",lwd=2,lty=2)
      qqnorm(resid(diamond_model),col = "orange",main = "Normal Q-Q Plot")
qqline(resid(diamond_model),lty=2,lwd=2,col = "red")
```

**From the above plot, both equal variance and normality assumptions are under suspect**

**(c)** Seeing as the price stretches over several orders of magnitude, it seems reasonable to try a log transformation of the response. Fit a model with a logged response, plot a scatterplot of log-price versus carat and add the line for the fitted model, then use a fitted versus residuals plot and/or a Q-Q plot to comment on the diagnostics of the model.

```{r fig.height=5, fig.width=10}
diamond_log_model = lm(log(price) ~ carat, data = diamonds);
plot(diamonds$carat,log(diamonds$price),col = "orange",main = "Price vs. Carat",xlab="Carat",ylab="Price")
abline(diamond_log_model, col = "red", lwd = 2)

par(mfrow = c(1,2))  
      plot(fitted(diamond_log_model),resid(diamond_log_model),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
      abline(h=0,col="red",lwd=2,lty=2)
      qqnorm(resid(diamond_log_model),col = "orange",main = "Normal Q-Q Plot")
qqline(resid(diamond_log_model),lty=2,lwd=2,col = "red")
```

**From the above plot, both equal variance and normality assumptions are still under suspect**

```{r}
qplot(price, data = diamonds, bins = 30)
```


**(d)** Try adding log transformation of the predictor. Fit a model with a logged response and logged predictor, plot a scatterplot of log-price versus log-carat and add the line for the fitted model, then use a fitted versus residuals plot and/or a Q-Q plot to comment on the diagnostics of the model.

```{r fig.height=5, fig.width=10}
diamond_all_log_model = lm(log(price) ~ log(carat), data = diamonds);
plot(log(diamonds$carat),log(diamonds$price),col = "orange",main = "Price vs. Carat",xlab="Carat",ylab="Price")
abline(diamond_all_log_model, col = "red", lwd = 2)

par(mfrow = c(1,2))  
      plot(fitted(diamond_all_log_model),resid(diamond_all_log_model),col="orange",xlab = "Fitted",ylab="Residuals",main = "Fitted vs. Residuals plot")
      abline(h=0,col="red",lwd=2,lty=2)
      qqnorm(resid(diamond_all_log_model),col = "orange",main = "Normal Q-Q Plot")
qqline(resid(diamond_all_log_model),lty=2,lwd=2,col = "red")
```

**From the above plot, both equal variance and normality seem to be much better**

**(e)** Use the model from part **(d)** to predict the price (in dollars) of a 3-carat diamond. Construct a 99% prediction interval for the price (in dollars).

```{r}
x = data.frame("carat" = 3)
price_lwr = exp(predict(diamond_all_log_model, newdata = x, interval = "prediction", level = 0.99)[,"lwr"])
price_upr = exp(predict(diamond_all_log_model, newdata = x, interval = "prediction", level = 0.99)[,"upr"])
```
We are 99% confident that a new observation for carat of 3 will fall between  `r price_lwr` and `r price_upr`