05-02-ML-Random-Forest.Rmd

## Random Forest {-}

##### Background {-}

The following dataset are career stats for over 1000 MLB baseball players. The data consists of the position of each player along with 19 numeric variables measure offense. The training set consists of 677 observations and the testing set has 339 observations. The objective is to build a model that will predict whether a player is in the Hall of Fame based on his career statistics. Since so few players make it to the Hall of Fame, the methodology for scoring the accuracy of models is based on the following calculation: $(sensitivity + 3*specificity) / 4$. The objective is to get as few incorrect predictions as possible, but having fewer false positives will affect the accuracy measure more than false negatives.

```{r b1, comment=NA, warning=FALSE, message=FALSE, fig.height=5, fig.width=9}
## required packages
library(randomForest)

## Training and Testing Data
hof.train = read.csv("data/HOF_tr.csv"); hof.test = read.csv("data/HOF_te.csv")

## remove unwanted columns
hof.train = hof.train[, -c(2:4)]; hof.test = hof.test[, -c(2:4)]

head(hof.train)
summary(hof.train)

## Simple Random Forest
(mdl = randomForest(HOF ~ ., ntree = 1000, data = hof.train))

## Function for testing accuracy
metric = function(confusion) {
  sensitivity = confusion[4] / (confusion[2] + confusion[4])
  specificity = confusion[1] / (confusion[1] + confusion[3])
  score = (sensitivity + (3 * specificity)) / 4
  return(score)
}

## Plot of the model performance
par(mfrow = c(1, 2))
plot(x = 1:1000, y = mdl$err.rate[,1], xlab = "Trees", ylab = "Error", type = "l",
     main = "Out of Sample Error Rate")
varImpPlot(mdl, main = "Variable Importance Plot")

```

##### Testing Model Accuracy on New Data {-}

Now that we have a trained model, we will apply the model to data that was not used in the training set. We will calculate the same accuracy score and compare the two.  If they are wildly different we may have a problem with overfitting.

```{r b2, comment=NA, warning=FALSE, message=FALSE, fig.height=5, fig.width=9}

## predict the probability of HOF
estimate = data.frame(predict(mdl, hof.test, type = "prob"))
estimate$predict = predict(mdl, hof.test)
estimate$actual = hof.test$HOF

## Generate a confusion matrix
(confusion = table(estimate[, 3:4]))

## Final Accuracy Measure
(test.metric = metric(confusion))

```

The random forest method is fairly robust to overfitting because it reserves some of the training data to use as test data which is called Out of Bag (OOB error). Because of this internal mechanism we could probably ues a larger portion of the overall data to train. The next sections tests this to see if accuracy is improved.

```{r b3, comment=NA, warning=FALSE, message=FALSE, fig.height=5, fig.width=9}

hof = rbind(hof.train, hof.test)

## create a training and testing set by randomly sampling from all of the data
set.seed(1002)
x = sample(nrow(hof), replace = FALSE)

## lets train the model on about 90% of the data
train = hof[x[1:900], ]
test = hof[-x[1:900], ]

## build the model
(mdl = randomForest(HOF ~ ., ntree = 1000, data = train))

## predict the probability of HOF
estimate = data.frame(predict(mdl, test, type = "prob"))
estimate$predict = predict(mdl, test)
estimate$actual = test$HOF

## confusion matrix
(confusion = table(estimate[, 3:4]))
(test.metric = metric(confusion))

## model plots
par(mfrow = c(1, 2))
plot(x = 1:1000, y = mdl$err.rate[,1], xlab = "Trees", ylab = "Error", type = "l",
     main = "Out of Sample Error Rate")
varImpPlot(mdl, main = "Variable Importance Plot")

```


We get slightly better results from increasing the training set. The Random Forest model is predicting Yes to Hall of Fame if it measures the probability > .5.  Since it is is so rare that a player gets voted to the Hall of Fame how accurate is the model if we lower the threshold? Based on a review of some of the false negatives I will make the minimum threshold for predicting yes .33

```{r b4, comment=NA, warning=FALSE, message=FALSE, fig.height=5, fig.width=9}

## predict the probability of HOF
estimate = data.frame(predict(mdl, test, type = "prob"))
estimate$predict = "N"
estimate$predict[which(estimate$Y > .33)] = "Y"
estimate$actual = test$HOF

## confusion matrix
(confusion = table(estimate[, 3:4]))
(test.metric = metric(confusion))
```