Random_forest_4_HD.Rmd

---
title: "Random Forest Model for Heart Disease Classification"
author: "Methun George"
date: "2024-09-24"
output:
  html_document: default
  pdf_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


Random Forest Machine Learning Algorithm

This markdown outlines the implementation of a Random Forest algorithm inspired by Josh Starmer's machine learning example. The dataset used is the Heart Disease dataset from the UCI Machine Learning Repository.

```{r}
#Loading the libraries

library(ggplot2)
library(cowplot)
library(randomForest)
```

We will be using the data from UCI repository 

```{r}
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
#read the data from the url
data <- read.csv(url, header=FALSE)

head(data)
```

The columns of the data are not named, so we checked the UCI website and assigned the names to it in the following line of code

```{r}
colnames(data) <- c(
  "age",
  "sex",# 0 = female, 1 = male
  "cp", # chest pain 
          # 1 = typical angina, 
          # 2 = atypical angina, 
          # 3 = non-anginal pain, 
          # 4 = asymptomatic
  "trestbps", # resting blood pressure (in mm Hg)
  "chol", # serum cholestoral in mg/dl
  "fbs",  # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
  "restecg", # resting electrocardiographic results
          # 1 = normal
          # 2 = having ST-T wave abnormality
          # 3 = showing probable or definite left ventricular hypertrophy
  "thalach", # maximum heart rate achieved
  "exang",   # exercise induced angina, 1 = yes, 0 = no
  "oldpeak", # ST depression induced by exercise relative to rest
  "slope", # the slope of the peak exercise ST segment 
          # 1 = upsloping 
          # 2 = flat 
          # 3 = downsloping 
  "ca", # number of major vessels (0-3) colored by fluoroscopy
  "thal", # this is short of thalium heart scan
          # 3 = normal (no cold spots)
          # 6 = fixed defect (cold spots during rest and exercise)
          # 7 = reversible defect (when cold spots only appear during exercise)
  "hd" # (the predicted attribute) - diagnosis of heart disease 
          # 0 if less than or equal to 50% diameter narrowing
          # 1 if greater than 50% diameter narrowing
  )

head(data)
```

Now the column names are changed and need to check the structure of the data
```{r}
str(data)
```

Now the following code is used to structure the data into the right format (changing the num to factor etc) 
```{r}
#replace "?" with "NA"
data[data == "?"] <- NA
#substitute 0 with F and 1 with M in the column for SEX
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
#convert some of the columns into factors
data$sex <- as.factor(data$sex)
data$cp  <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)

data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)

data$thal <- as.integer(data$thal)
data$thal <- as.factor(data$thal)

## This next line replaces 0 and 1 with "Healthy" and "Unhealthy"
data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd)

str(data)
```
Now the seed is set to 42 for the reproducibility and impute the values for NA with rfImpute function. The prediction will be on the basis of HD ( Heart Disease ) in this case.
```{r}
set.seed(42)

data.imputed <- rfImpute(hd ~ ., data = data, iter=6 )
```
Here the OOB stands for out-of-bag,  this is the error rate. The OOB values gets smaller with improved estimations.
create the model with data.imputed as the data

```{r}
model <- randomForest(hd ~ ., data = data.imputed, proximity=TRUE)
model
```
Above is the summary of the randomForest model created.
Here, 142 healthy patinets correctly predicted to be Healthy but 29 unhealthy patients precdited to be Healthy as well.
same with 22 and 110.  
To see if 500 trees enough for optimal classification, we can plot the error rates using ggplot.
```{r}
oob.error.data <- data.frame(
  Trees=rep(1:nrow(model$err.rate), times=3),
  Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model$err.rate)),
  Error=c(model$err.rate[,"OOB"], 
    model$err.rate[,"Healthy"], 
    model$err.rate[,"Unhealthy"]))

ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
  geom_line(aes(color=Type))

```
The above graph represent the error rate respect to the no of trees.To check the influence of the number of trees in error rate, the no of trees were doubled in the following code
```{r}
model_2 <- randomForest(hd ~ ., data = data.imputed, ntree=1000, proximity=TRUE)
model_2
```
This model is compared to the old model with 500 trees and found that former was better. the graph is also created with ggplot to see the difference.
```{r}
oob.error.data_2 <- data.frame(
  Trees=rep(1:nrow(model_2$err.rate), times=3),
  Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model_2$err.rate)),
  Error=c(model_2$err.rate[,"OOB"], 
    model_2$err.rate[,"Healthy"], 
    model_2$err.rate[,"Unhealthy"]))

ggplot(data=oob.error.data_2, aes(x=Trees, y=Error)) +
  geom_line(aes(color=Type))
```
Here the error rate stabilize after 500 trees.

Now we consider the optimal number of nodes in the tree (it was 3 when we look at the summary of model)

```{r}
#create an empty vector that holds 10 values
oob.values <- vector(length=10)
# If we want to compare this random forest to others with different values for mtry (to control how many variables are considered at each step), This loop will test different numbers of variables at each step
for (i in 1:10) {
  temp.model <- randomForest(hd ~ ., data=data.imputed, mtry=i, ntree=1000)
  oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values
```

here we see the node number set to 5 has the lowest error rate. ie: 0.1683168, choose node no 5 for a better model creation.

Now we create an MDS plot using the random forest model created, this will help to understand how the samples are related to each other.

```{r}
#create a distance matrix by converting proximity matrix.
distance.matrix <- as.dist(1-model$proximity)
#run cmdscale on distance matrix 
mds.stuff <- cmdscale(distance.matrix, eig=TRUE, x.ret=TRUE)
#calculate % of variation in distance matrix
mds.var.per <- round(mds.stuff$eig/sum(mds.stuff$eig)*100, 1)
#format the data for ggplot
mds.values <- mds.stuff$points
mds.data <- data.frame(Sample=rownames(mds.values),
  X=mds.values[,1],
  Y=mds.values[,2],
  Status=data.imputed$hd)

ggplot(data=mds.data, aes(x=X, y=Y, label=Sample)) + 
  geom_text(aes(color=Status)) +
  theme_bw() +
  xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) +
  ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) +
  ggtitle("MDS plot using (1 - Random Forest Proximities)")
```

The healthy samples are on the left side and unhealthy samples are on the right side. The patient 251 ( on the right side along with unhealthy patinets) was misdiagnosed. the X axis accounts for 15.1 % of the variation in the distance matrix and Y axis only accounts for 5.5% of the variation. This indiacate that the big difference is on the X axis.