w09-hw-balajis2.Rmd

---
title: "Week 9 - Homework"
author: "STAT 420, Summer 2019, BALAJI SATHYAMURTHY (BALAJIS2)"
date: '07/14/2019'
output:
  html_document: 
    toc: yes
  pdf_document: default
urlcolor: cyan
---

***

```{r setup, echo = FALSE, message = FALSE, warning = FALSE}
options(scipen = 1, digits = 4, width = 80, fig.align = "center")
```

## Exercise 1 (`longley` Macroeconomic Data)

The built-in dataset `longley` contains macroeconomic data for predicting employment. We will attempt to model the `Employed` variable.

```{r, eval = FALSE}
View(longley)
?longley
```

**(a)** What is the largest correlation between any pair of predictors in the dataset?

```{r}
cor(longley)[which.max(cor(longley))]
```

The largest correlation between any pair of predictors in the dataset is `r cor(longley)[which.max(cor(longley))]`

**(b)** Fit a model with `Employed` as the response and the remaining variables as predictors. Calculate and report the variance inflation factor (VIF) for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?

```{r}
eco_full_model = lm(Employed~.,data = longley )
car::vif(eco_full_model)
(max_vif = car::vif(eco_full_model)[which.max(car::vif(eco_full_model))])
```

The variable with largest VIF `r max_vif` is GNP

**(c)** What proportion of the observed variation in `Population` is explained by a linear relationship with the other predictors?

```{r}
eco_population_model = lm(Population~.-Employed,data = longley)
(summary(eco_population_model)$r.squared)
```

The proportion of the observed variation in population explained is `r summary(eco_population_model)$r.squared`

**(d)** Calculate the partial correlation coefficient for `Population` and `Employed` **with the effects of the other predictors removed**.

```{r}
eco_employed_model = lm(Employed~.-Population,data = longley )
x = cor(resid(eco_population_model),resid(eco_employed_model))
```

The partial correlation coefficient for `Population` and `Employed` with the effects of the other predictors removed is `r x`

**(e)** Fit a new model with `Employed` as the response and the predictors from the model in **(b)** that were significant. (Use $\alpha = 0.05$.) Calculate and report the variance inflation factor for each of the predictors. Which variable has the largest VIF? Do any of the VIFs suggest multicollinearity?

```{r}
summary(eco_full_model)$coefficients[,"Pr(>|t|)"] < 0.05
eco_new_model = lm(Employed~Unemployed+Armed.Forces+Year,data = longley)
max_vif = car::vif(eco_new_model)[which.max(car::vif(eco_new_model))]
```

The variable with the largest VIF `r max_vif` is Year

**(f)** Use an $F$-test to compare the models in parts **(b)** and **(e)**. Report the following:

- The null hypothesis
  $\beta_{GNP.deflator}$ = $\beta_{GNP}$ = $\beta_{Population}$ = 0

- The test statistic
```{r}
x = anova(eco_full_model,eco_new_model)[2,"F"]
```

The test statistic is `r x`

- The distribution of the test statistic under the null hypothesis
```{r}
anova(eco_full_model,eco_new_model)
```

- The p-value
```{r}
x = anova(eco_full_model,eco_new_model)[2,"Pr(>F)"]
```

The p-value is `r x`

- A decision
`Failed to reject the null hypothesis`

- Which model you prefer, **(b)** or **(e)**
`The model with only Unemployed, Armed.Forces and Year as predictors`

**(g)** Check the assumptions of the model chosen in part **(f)**. Do any assumptions appear to be violated?

```{r, echo = FALSE}
plot_fitted_resid = function(model, pointcol = "dodgerblue", linecol = "darkorange") {
  plot(fitted(model), resid(model), 
       col = pointcol, pch = 20, cex = 1.5,
       xlab = "Fitted", ylab = "Residuals")
  abline(h = 0, col = linecol, lwd = 2)
}

plot_qq = function(model, pointcol = "dodgerblue", linecol = "darkorange") {
  qqnorm(resid(model), col = pointcol, pch = 20, cex = 1.5)
  qqline(resid(model), col = linecol, lwd = 2)
}

par(mfrow = c(1,2))
plot_fitted_resid(eco_new_model)
plot_qq(eco_new_model)
```

None of the assumptions seemed to be violated as per the plots.

***

## Exercise 2 (`Credit` Data)

For this exercise, use the `Credit` data from the `ISLR` package. Use the following code to remove the `ID` variable which is not useful for modeling.

```{r}
library(ISLR)
data(Credit)
Credit = subset(Credit, select = -c(ID))
```

Use `?Credit` to learn about this dataset.

**(a)** Find a "good" model for `balance` using the available predictors. Use any methods seen in class except transformations of the response. The model should:

- Reach a LOOCV-RMSE below `135`
- Obtain an adjusted $R^2$ above `0.90`
- Fail to reject the Breusch-Pagan test with an $\alpha$ of $0.01$
- Use fewer than 10 $\beta$ parameters

```{r}
balance_all_model = lm(Balance ~ ., data = Credit);
balance_model_back_aic = step(balance_all_model,direction = "backward")
balance_model_backward_aic = lm(Balance ~ Income + Limit + Rating + Cards + Age + Student,data = Credit)
mod_a = balance_model_backward_aic
```

Store your model in a variable called `mod_a`. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.

```{r message=FALSE, warning=FALSE}
library(lmtest)

get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}
```


```{r}
get_loocv_rmse(mod_a)
get_adj_r2(mod_a)
get_bp_decision(mod_a, alpha = 0.01)
get_num_params(mod_a)
```

`The model above satisfies only 3 out of the 4 criterias mentioned.`

**(b)** Find another "good" model for `balance` using the available predictors. Use any methods seen in class except transformations of the response. The model should:

- Reach a LOOCV-RMSE below `125`
- Obtain an adjusted $R^2$ above `0.91`
- Fail to reject the Shapiro-Wilk test with an $\alpha$ of $0.01$
- Use fewer than 25 $\beta$ parameters

Store your model in a variable called `mod_b`. Run the two given chunks to verify your model meets the requested criteria. If you cannot find a model that meets all criteria, partial credit will be given for meeting at least some of the criteria.

`Added log to the limit variable`

```{r}
mod_b = lm(Balance ~ Income + log(Limit) + Rating + Cards + Age + Student,data = Credit)
```

```{r, message = FALSE, warning = FALSE}
library(lmtest)

get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}
```

```{r}
get_loocv_rmse(mod_b)
get_adj_r2(mod_b)
get_sw_decision(mod_b, alpha = 0.01)
get_num_params(mod_b)
```

`The above model satisfies the all criteria for question b`

***

## Exercise 3 (`Sacramento` Housing Data)

For this exercise, use the `Sacramento` data from the `caret` package. Use the following code to perform some preprocessing of the data.

```{r}
library(caret)
library(ggplot2)
data(Sacramento)
sac_data = Sacramento
sac_data$limits = factor(ifelse(sac_data$city == "SACRAMENTO", "in", "out"))
sac_data = subset(sac_data, select = -c(city, zip))
```

Instead of using the `city` or `zip` variables that exist in the dataset, we will simply create a variable (`limits`) indicating whether or not a house is technically within the city limits of Sacramento. (We do this because they would both be factor variables with a **large** number of levels. This is a choice that is made due to laziness, not necessarily because it is justified. Think about what issues these variables might cause.)

Use `?Sacramento` to learn more about this dataset.

A plot of longitude versus latitude gives us a sense of where the city limits are.

```{r}
qplot(y = longitude, x = latitude, data = sac_data,
      col = limits, main = "Sacramento City Limits ")
```

After these modifications, we test-train split the data.

```{r}
set.seed(420)
sac_trn_idx  = sample(nrow(sac_data), size = trunc(0.80 * nrow(sac_data)))
sac_trn_data = sac_data[sac_trn_idx, ]
sac_tst_data = sac_data[-sac_trn_idx, ]
```

The training data should be used for all model fitting. Our goal is to find a model that is useful for predicting home prices.

**(a)** Find a "good" model for `price`. Use any methods seen in class. The model should reach a LOOCV-RMSE below 77,500 in the training data. Do not use any transformations of the response variable.

```{r}
sac_all_model = lm(price ~ ., data = sac_trn_data)
n = length(resid(sac_all_model))

#BACKWARD AIC MODEL
sac_model_back_aic = step(sac_all_model,direction = "backward")
sac_model_backward_aic = lm(price ~ beds + sqft + type + latitude + longitude,data = sac_trn_data)
LOOCV_RMSE_back_aic = sqrt(mean((resid(sac_model_backward_aic)/(1-hatvalues(sac_model_backward_aic)))^2))

#BACKWARD BIC MODEL
n = length(resid(sac_all_model))
sac_model_back_bic = step(sac_all_model,direction = "backward",k=log(n))
sac_model_backward_bic = lm(price ~ beds + sqft + longitude,data = sac_trn_data)
LOOCV_RMSE_back_bic = sqrt(mean((resid(sac_model_backward_bic)/(1-hatvalues(sac_model_backward_bic)))^2))

#FORWARD AIC MODEL
sac_mod_start = lm(price ~ 1, data = sac_trn_data)
sac_model_forward_aic = step(sac_mod_start,scope = Price~beds+baths+sqft+type+latitude+longitude+limits,direction= "forward")
sac_model_forward_aic = lm(price ~ sqft + longitude + beds + latitude + type,data = sac_trn_data)
LOOCV_RMSE_forward_aic = sqrt(mean((resid(sac_model_forward_aic)/(1-hatvalues(sac_model_forward_aic)))^2))

#FORWARD BIC MODEL
sac_model_forward_bic = step(sac_mod_start,scope = Price~beds+baths+sqft+type+latitude+longitude+limits,direction= "forward",k=log(n))
sac_model_forward_bic = lm(price ~ sqft + longitude + beds,data = sac_trn_data)
LOOCV_RMSE_forward_bic = sqrt(mean((resid(sac_model_forward_bic)/(1-hatvalues(sac_model_forward_bic)))^2))

min(LOOCV_RMSE_back_aic,LOOCV_RMSE_back_bic,LOOCV_RMSE_forward_aic,LOOCV_RMSE_forward_bic)
```

The model obtained via backward AIC achieved a low LOOCV RMSE of `r LOOCV_RMSE_back_aic`

**(b)** Is a model that achieves a LOOCV-RMSE below 77,500 useful in this case? That is, is an average error of 77,500 low enough when predicting home prices? To further investigate, use the held-out test data and your model from part **(a)** to do two things:

- Calculate the average percent error:
\[
\frac{1}{n}\sum_i\frac{|\text{predicted}_i - \text{actual}_i|}{\text{predicted}_i} \times 100
\]
- Plot the predicted versus the actual values and add the line $y = x$.

Based on all of this information, argue whether or not this model is useful.

```{r}
sac_pred = predict(sac_model_backward_aic, newdata = sac_tst_data);
avg_pct_err = mean(sum(abs(sac_pred - sac_tst_data$price)/sac_pred )) * 100 
x = sac_pred
y = sac_tst_data$price
plot(x, y, col = "dodgerblue", pch = 20,main = "Prediction vs Actual",xlab = "Prediction",ylab = "Actual")
sac_pred_mod = lm(y~x,data = sac_tst_data)
abline(sac_pred_mod,lwd=2,col = "green")

```

The model has an error percentage of `r avg_pct_err` which is too high and based on this the model is not useful.

***

## Exercise 4 (Does It Work?)

In this exercise, we will investigate how well backwards AIC and BIC actually perform. For either to be "working" correctly, they should result in a low number of both **false positives** and **false negatives**. In model selection,

- **False Positive**, FP: Incorrectly including a variable in the model. Including a *non-significant* variable
- **False Negative**, FN: Incorrectly excluding a variable in the model. Excluding a *significant* variable

Consider the **true** model

\[
Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \beta_7 x_7 + \beta_8 x_8 + \beta_9 x_9 + \beta_{10} x_{10} + \epsilon
\]

where $\epsilon \sim N(0, \sigma^2 = 4)$. The true values of the $\beta$ parameters are given in the `R` code below.

```{r}
beta_0  = 1
beta_1  = -1
beta_2  = 2
beta_3  = -2
beta_4  = 1
beta_5  = 1
beta_6  = 0
beta_7  = 0
beta_8  = 0
beta_9  = 0
beta_10 = 0
sigma = 2
```

Then, as we have specified them, some variables are significant, and some are not. We store their names in `R` variables for use later.

```{r}
not_sig  = c("x_6", "x_7", "x_8", "x_9", "x_10")
signif = c("x_1", "x_2", "x_3", "x_4", "x_5")
```

We now simulate values for these `x` variables, which we will use throughout part **(a)**.

```{r}
set.seed(19830502)
n = 100
x_1  = runif(n, 0, 10)
x_2  = runif(n, 0, 10)
x_3  = runif(n, 0, 10)
x_4  = runif(n, 0, 10)
x_5  = runif(n, 0, 10)
x_6  = runif(n, 0, 10)
x_7  = runif(n, 0, 10)
x_8  = runif(n, 0, 10)
x_9  = runif(n, 0, 10)
x_10 = runif(n, 0, 10)
```

We then combine these into a data frame and simulate `y` according to the true model.

```{r}
sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
)
```

We do a quick check to make sure everything looks correct.

```{r}
head(sim_data_1)
```

Now, we fit an incorrect model.

```{r}
fit = lm(y ~ x_1 + x_2 + x_6 + x_7, data = sim_data_1)
coef(fit)
```

Notice, we have coefficients for `x_1`, `x_2`, `x_6`, and `x_7`. This means that `x_6` and `x_7` are false positives, while `x_3`, `x_4`, and `x_5` are false negatives.

To detect the false negatives, use:

```{r}
# which are false negatives?
!(signif %in% names(coef(fit)))
```

To detect the false positives, use:

```{r}
# which are false positives?
names(coef(fit)) %in% not_sig
```

Note that in both cases, you could `sum()` the result to obtain the number of false negatives or positives.

**(a)** Set a seed equal to your birthday; then, using the given data for each `x` variable above in `sim_data_1`, simulate the response variable `y` 300 times. Each time,

- Fit an additive model using each of the `x` variables.
- Perform variable selection using backwards AIC.
- Perform variable selection using backwards BIC.
- Calculate and store the number of false negatives for the models chosen by AIC and BIC.
- Calculate and store the number of false positives for the models chosen by AIC and BIC.

Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table.

```{r include=FALSE}
set.seed(19830502)
sim_num = 300;
n = 100

total_false_neg_aic = rep(0, sim_num)
total_false_pos_aic = rep(0, sim_num)
total_false_neg_bic = rep(0, sim_num)
total_false_pos_bic = rep(0, sim_num)

for(i in 1:sim_num){
  x_1  = runif(n, 0, 10)
  x_2  = runif(n, 0, 10)
  x_3  = runif(n, 0, 10)
  x_4  = runif(n, 0, 10)
  x_5  = runif(n, 0, 10)
  x_6  = runif(n, 0, 10)
  x_7  = runif(n, 0, 10)
  x_8  = runif(n, 0, 10)
  x_9  = runif(n, 0, 10)
  x_10 = runif(n, 0, 10)  
  
  sim_data_1 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
  )
  
  add_model = lm(y ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10, data = sim_data_1);
  
  best_back_aic = step(add_model, direction = "backward")
  best_back_bic = step(add_model, direction = "backward", k = log(n))
  
  total_false_neg_aic[i] = sum(!(signif %in% names(coef(best_back_aic))))
  total_false_pos_aic[i] = sum(names(coef(best_back_aic)) %in% not_sig)
  
  total_false_neg_bic[i] = sum(!(signif %in% names(coef(best_back_bic))))
  total_false_pos_bic[i] = sum(names(coef(best_back_bic)) %in% not_sig)
}
```

```{r}
errors_output = data.frame(
  "AIC Errors" = c(
    "False Negative" = mean(total_false_neg_aic),
    "False Positive" = mean(total_false_pos_aic)
  ), 
  "BIC Errors" = c(
    "False Negative" = mean(total_false_neg_bic),
    "False Positive" = mean(total_false_pos_bic)
  )
)

library(flextable)
library(magrittr)
flextable(errors_output)
```

**(b)** Set a seed equal to your birthday; then, using the given data for each `x` variable below in `sim_data_2`, simulate the response variable `y` 300 times. Each time,

- Fit an additive model using each of the `x` variables.
- Perform variable selection using backwards AIC.
- Perform variable selection using backwards BIC.
- Calculate and store the number of false negatives for the models chosen by AIC and BIC.
- Calculate and store the number of false positives for the models chosen by AIC and BIC.

Calculate the rate of false positives and negatives for both AIC and BIC. Compare the rates between the two methods. Arrange your results in a well formatted table. Also compare to your answers in part **(a)** and suggest a reason for any differences.

```{r}
set.seed(420)
x_1  = runif(n, 0, 10)
x_2  = runif(n, 0, 10)
x_3  = runif(n, 0, 10)
x_4  = runif(n, 0, 10)
x_5  = runif(n, 0, 10)
x_6  = runif(n, 0, 10)
x_7  = runif(n, 0, 10)
x_8  = x_1 + rnorm(n, 0, 0.1)
x_9  = x_1 + rnorm(n, 0, 0.1)
x_10 = x_2 + rnorm(n, 0, 0.1)

sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
)
```

```{r include=FALSE}
set.seed(19830502)
sim_num = 300;
n = 100

total_false_neg_aic = rep(0, sim_num)
total_false_pos_aic = rep(0, sim_num)
total_false_neg_bic = rep(0, sim_num)
total_false_pos_bic = rep(0, sim_num)

for(i in 1:sim_num){
  x_1  = runif(n, 0, 10)
  x_2  = runif(n, 0, 10)
  x_3  = runif(n, 0, 10)
  x_4  = runif(n, 0, 10)
  x_5  = runif(n, 0, 10)
  x_6  = runif(n, 0, 10)
  x_7  = runif(n, 0, 10)
  x_8  = x_1 + rnorm(n, 0, 0.1)
  x_9  = x_1 + rnorm(n, 0, 0.1)
  x_10 = x_2 + rnorm(n, 0, 0.1)
  
  sim_data_2 = data.frame(x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_10,
  y = beta_0 + beta_1 * x_1 + beta_2 * x_2 + beta_3 * x_3 + beta_4 * x_4 + 
      beta_5 * x_5 + rnorm(n, 0 , sigma)
  )
  
  add_model = lm(y ~ x_1 + x_2 + x_3 + x_4 + x_5 + x_6 + x_7 + x_8 + x_9 + x_10, data = sim_data_2);
  
  best_back_aic = step(add_model, direction = "backward")
  best_back_bic = step(add_model, direction = "backward", k = log(n))
  
  total_false_neg_aic[i] = sum(!(signif %in% names(coef(best_back_aic))))
  total_false_pos_aic[i] = sum(names(coef(best_back_aic)) %in% not_sig)
  
  total_false_neg_bic[i] = sum(!(signif %in% names(coef(best_back_bic))))
  total_false_pos_bic[i] = sum(names(coef(best_back_bic)) %in% not_sig)
}
```

```{r}
errors_output = data.frame(
  "AIC Errors" = c(
    "False Negative" = mean(total_false_neg_aic),
    "False Positive" = mean(total_false_pos_aic)
  ), 
  "BIC Errors" = c(
    "False Negative" = mean(total_false_neg_bic),
    "False Positive" = mean(total_false_pos_bic)
  )
)

library(flextable)
library(magrittr)
flextable(errors_output)
```

There is colinearity between x_8,x_9 and x_10 with the significant predictors x_1 and x_2 which will result in more errors.