day7/ME114_assignment7_solution.Rmd

---
title: "Assignment 7 - Machine Learning"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---

Assignments for the course focus on practical aspects of the concepts covered in the lectures. Assignments are largely based on the material covered in James et al. (2013). You will start working on the assignment in the lab sessions after the lectures, but may need to finish them after class.

You will be asked to submit your assignments via Moodle by 7pm on the day of the class. We will subsequently open up solutions to the problem sets. 

### Exercise summary

This exercise revisits regression as a "machine learning" tool, both for linear and logisitic regression, as well as simple machine learning using the Naive Bayes and kNN algorithms.   

For this execise we will use the Stata dataset  `dail2002.dta` from the article Kenneth Benoit and Michael Marsh. 2008. "[The Campaign Value of Incumbency: A New Solution to the Puzzle of Less Effective Incumbent Spending.](http://www.kenbenoit.net/pdfs/ajps_348.pdf)" *American Journal of Political Science* 52(4, October): 874-890.  This is available directly  [here](http://www.kenbenoit.net/files/dail2002.dta).  To load this into R, you will need the `read.dta` command from the `foreign` package.  (Note that you can load straight from the URL using this command.)  Call this data object `dail2002`. 
```{r}
require(foreign, quietly = TRUE)
dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta")
```
    
1.  **Partitioning the dataset into "folds".**  For this exercise, we will be fitting a model from 
    a subset of the data, and using the fitted model to predict the outcomes from the "left out" set 
    and using that to evaluate root mean squared error (RMSE) or accuracy.
    
    Use the `sample()` command to draw a random sample of one-fifth of the observations in the `dail2002.dta` dataset as a test set, and four-fifths to be a training set.
    ```{r}
    dail2002$tclass <- sample(c("test", "train"), nrow(dail2002), replace = TRUE, prob = c(.2, .8))
    ```
    
    We will use this indicator variable `tclass` to partition the dataset in the questions below.  Be sure you understand how we used the `sample()` command this way!
 
2.  **OLS regression for prediction.**

    1.  Fit a regression from the dataset to predict `votes1st`.  You may use any combination of
        regressors that you wish.  Save the model object to `reg2_1`.
        
        ```{r}
        require(foreign)
        dail2002 <- read.dta("http://www.kenbenoit.net/files/dail2002.dta")
        reg2_1 <- lm(votes1st ~ incumb:spend_total + senator + gender + electorate, data=dail2002)
        ```
        
    2.  Predict the `votes1st` from the same sample to which you fitted the regression.  What is the 
        Root Mean Squared Error (RMSE) and how would you interpret this?
        
        ```{r}
        length(predict(reg2_1))   # shortened to 462 because of the two missing cases
        # refit model with different behaviour for na.action
        reg2_1 <- lm(votes1st ~ incumb*spend_total + senator + gender + electorate, 
                     data=dail2002, na.action=na.exclude)
        reg2_1pred <- predict(reg2_1)
        length(reg2_1pred)  # now includes missing cases
        sum(is.na(reg2_1pred))

        # function to compute RMSE
        computeRMSE <- function(myvar) {
            sqrt(sum(residuals(myvar)^2, na.rm=TRUE) / df.residual(myvar))
        }
        computeRMSE(reg2_1)
        
        # compare to output from summary()
        summary(reg2_1)
        ```

    3.  Drop the incumbency variable -- that you hopefully included in your answer to 2.1! -- and
        repeat steps 2.1--2.2.  Compute a new RMSE and compare this to the previous one.  Which
        is a better predictor?
        ```{r}
        reg2_3 <- lm(votes1st ~ spend_total + senator + gender + electorate, 
                     data=dail2002, na.action=na.exclude)
        computeRMSE(reg2_3)
        summary(reg2_3)$r.squared
        ```
        **The model with incumbency is a better predictor, with a lower RMSE and a higher $R^2$.**
        
        
3.  **Logistic regression for prediction**.

    1.  Fit a logistic regression (hint: use `glm()`) to predict the outcome variable `wonseat`.  Use 
        any specification that you think provides a good prediction.
        
        ```{r}
        reg3_1 <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate, data=dail2002, family=binomial, na.action=na.exclude)
        ```
    
    2.  For the full sample, compute:
        
        *  a table of actual `wonseat` by predicted `wonseat`
        ```{r}
        wonseat_predicted <- ifelse(predict(reg3_1) > 0, 1, 0)
        (predtable <- table(wonseat_predicted, dail2002$wonseat))
        ```
        
        *  percent correctly predicted
        ```{r}
        sum(diag(prop.table(predtable))) * 100
        ```

        *  precision
        *  recall
        ```{r}
        # define a general function to compute precision & recall
        precrecall <- function(mytable, verbose=TRUE) {
            truePositives <- mytable[1,1]
            falsePositives <- sum(mytable[1,]) - truePositives
            falseNegatives <- sum(mytable[,1]) - truePositives
            precision <- truePositives / (truePositives + falsePositives)
            recall <- truePositives / (truePositives + falseNegatives)
            if (verbose) {
                print(mytable)
                cat("\n precision =", round(precision, 2), 
                    "\n    recall =", round(recall, 2), "\n")
            }
            invisible(c(precision, recall))
        }
        # compute precision of wonseat (not lost seat!)
        precrecall(predtable[2:1, 2:1])
        ```


    3.  Comparing two models.
    
        *  Compute an 8-fold validation, where for 8 different training sets consisting of 7/8 of the observations, you predict the other held-out 1/8 and compare the actual to predicted for the 1/8 test set.  Compute an average F1 score for the 8 models.
        
        ```{r}
        # here it helps to know that n=464 and 464 %% 8 == 0
        foldIndex <- rep(1:8, each = nrow(dail2002)/8)
        # initialize vector to store F1 scores
        f1result <- c()
        # loop over folds
        cat("fold number: ")
        for (i in 1:max(foldIndex)) {
            cat(i)
            glmresult <- glm(wonseat ~ incumb:spend_total + senator + gender + electorate, 
                             na.action = na.exclude, family = "binomial",
                             data = subset(dail2002, foldIndex != i))
            wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
            predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
            prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
            f1result[i] <- 2 * prod(prrec) / sum(prrec)
        }
        cat("\n")
        f1result
        mean(f1result)
        ```
        
        *  Now drop a variable or two, and repeat the previous step to compare the average F1 score for this model.
        ```{r}
        f1result <- c()
        # loop over folds
        for (i in 1:8) {
            glmresult <- glm(wonseat ~ spend_total + senator + gender + electorate, 
                             na.action = na.exclude, family = "binomial",
                             data = subset(dail2002, foldIndex != i))
            wonseatPred <- ifelse(predict(glmresult, newdata = subset(dail2002, foldIndex == i)) > 0, 1, 0)
            predtable <- table(wonseatPred, subset(dail2002, foldIndex == i)$wonseat)
            prrec <- precrecall(predtable[2:1, 2:1], verbose=FALSE)
            f1result[i] <- 2 * prod(prrec) / sum(prrec)
        }
        f1result
        mean(f1result)
        ```
        *  Why is it valuable to use the different folds here, rather than simply comparing the F1 score for the predicted outcome of the entire sample, fit to the entire sample?
        **Because fitting on the whole sample can lead to overfitting.  For calibrating a predictive model we need to test it out of sample.**
    
    
4.  **kNN prediction.**

    1.  Fit a $k=1$ kNN model to predict the same specification as in your logistic regression.  Compare the 
        percent correctly predicted.  Which model worked better?  (Note: You can use the whole sample here, so that this will compare the results to those from 3.1).
        ```{r}
        # get the variables we want
        dail2002sub <- subset(dail2002, select = c(spend_total, incumb, gender, electorate))
        # convert the factor to numeric 0 or 1
        dail2002sub$gender <- unclass(dail2002sub$gender) - 1
        # figure out which are complete
        completeCasesIndex <- complete.cases(dail2002sub)
        # set the training and test datasets to be identical
        train <- test <- dail2002sub[completeCasesIndex, ]
        require(class)
        # make the true class a factor
        wonseatf <- factor(dail2002$wonseat[completeCasesIndex], labels = c("No", "Yes"))
        knnPred <- knn(train, test, wonseatf, k=1)
        precrecall(table(knnPred, wonseatf)[2:1, 2:1])
        ```
        
        
    2.  Experiment with two more settings of $k$ to see how this affects prediction, reporting the percent
        correctly predicted.
        ```{r}
        # k = 2
        precrecall(table(knnPred <- knn(train, test, wonseatf, k=2), wonseatf)[2:1, 2:1])
        # k = 3
        precrecall(table(knnPred <- knn(train, test, wonseatf, k=3), wonseatf)[2:1, 2:1])
        # k = 4
        precrecall(table(knnPred <- knn(train, test, wonseatf, k=4), wonseatf)[2:1, 2:1])
        # k = 5
        precrecall(table(knnPred <- knn(train, test, wonseatf, k=5), wonseatf)[2:1, 2:1])
        ```
        
        We can also plot precision by recall to get an ROC curve-type plot, to evaluate performance at each level of `k`. 
        Note that this will not produce "curves" because we are not fitting to different folds or datasets, just once
        to recover the predictions from the whole training set.
        ```{r}
        maxk <- 10
        prresults <- data.frame(precision = NA, recall = NA)
        for (k in 1:maxk) {
            prresults[k, ] <- precrecall(table(knnPred <- knn(train, test, wonseatf, k=k), wonseatf)[2:1, 2:1],
                                         verbose=FALSE)
        }
        with(prresults, plot(recall, precision, type="n", xlim=c(.5, 1), ylim=c(.5, 1)))
        with(prresults, text(recall, precision, paste0("k=", 1:maxk)))
        ```