-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRandom_forest_4_HD.Rmd
195 lines (164 loc) · 7.06 KB
/
Random_forest_4_HD.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Random Forest Model for Heart Disease Classification"
author: "Methun George"
date: "2024-09-24"
output:
html_document: default
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Random Forest Machine Learning Algorithm
This markdown outlines the implementation of a Random Forest algorithm inspired by Josh Starmer's machine learning example. The dataset used is the Heart Disease dataset from the UCI Machine Learning Repository.
```{r}
#Loading the libraries
library(ggplot2)
library(cowplot)
library(randomForest)
```
We will be using the data from UCI repository
```{r}
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
#read the data from the url
data <- read.csv(url, header=FALSE)
head(data)
```
The columns of the data are not named, so we checked the UCI website and assigned the names to it in the following line of code
```{r}
colnames(data) <- c(
"age",
"sex",# 0 = female, 1 = male
"cp", # chest pain
# 1 = typical angina,
# 2 = atypical angina,
# 3 = non-anginal pain,
# 4 = asymptomatic
"trestbps", # resting blood pressure (in mm Hg)
"chol", # serum cholestoral in mg/dl
"fbs", # fasting blood sugar if less than 120 mg/dl, 1 = TRUE, 0 = FALSE
"restecg", # resting electrocardiographic results
# 1 = normal
# 2 = having ST-T wave abnormality
# 3 = showing probable or definite left ventricular hypertrophy
"thalach", # maximum heart rate achieved
"exang", # exercise induced angina, 1 = yes, 0 = no
"oldpeak", # ST depression induced by exercise relative to rest
"slope", # the slope of the peak exercise ST segment
# 1 = upsloping
# 2 = flat
# 3 = downsloping
"ca", # number of major vessels (0-3) colored by fluoroscopy
"thal", # this is short of thalium heart scan
# 3 = normal (no cold spots)
# 6 = fixed defect (cold spots during rest and exercise)
# 7 = reversible defect (when cold spots only appear during exercise)
"hd" # (the predicted attribute) - diagnosis of heart disease
# 0 if less than or equal to 50% diameter narrowing
# 1 if greater than 50% diameter narrowing
)
head(data)
```
Now the column names are changed and need to check the structure of the data
```{r}
str(data)
```
Now the following code is used to structure the data into the right format (changing the num to factor etc)
```{r}
#replace "?" with "NA"
data[data == "?"] <- NA
#substitute 0 with F and 1 with M in the column for SEX
data[data$sex == 0,]$sex <- "F"
data[data$sex == 1,]$sex <- "M"
#convert some of the columns into factors
data$sex <- as.factor(data$sex)
data$cp <- as.factor(data$cp)
data$fbs <- as.factor(data$fbs)
data$restecg <- as.factor(data$restecg)
data$exang <- as.factor(data$exang)
data$slope <- as.factor(data$slope)
data$ca <- as.integer(data$ca)
data$ca <- as.factor(data$ca)
data$thal <- as.integer(data$thal)
data$thal <- as.factor(data$thal)
## This next line replaces 0 and 1 with "Healthy" and "Unhealthy"
data$hd <- ifelse(test=data$hd == 0, yes="Healthy", no="Unhealthy")
data$hd <- as.factor(data$hd)
str(data)
```
Now the seed is set to 42 for the reproducibility and impute the values for NA with rfImpute function. The prediction will be on the basis of HD ( Heart Disease ) in this case.
```{r}
set.seed(42)
data.imputed <- rfImpute(hd ~ ., data = data, iter=6 )
```
Here the OOB stands for out-of-bag, this is the error rate. The OOB values gets smaller with improved estimations.
create the model with data.imputed as the data
```{r}
model <- randomForest(hd ~ ., data = data.imputed, proximity=TRUE)
model
```
Above is the summary of the randomForest model created.
Here, 142 healthy patinets correctly predicted to be Healthy but 29 unhealthy patients precdited to be Healthy as well.
same with 22 and 110.
To see if 500 trees enough for optimal classification, we can plot the error rates using ggplot.
```{r}
oob.error.data <- data.frame(
Trees=rep(1:nrow(model$err.rate), times=3),
Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model$err.rate)),
Error=c(model$err.rate[,"OOB"],
model$err.rate[,"Healthy"],
model$err.rate[,"Unhealthy"]))
ggplot(data=oob.error.data, aes(x=Trees, y=Error)) +
geom_line(aes(color=Type))
```
The above graph represent the error rate respect to the no of trees.To check the influence of the number of trees in error rate, the no of trees were doubled in the following code
```{r}
model_2 <- randomForest(hd ~ ., data = data.imputed, ntree=1000, proximity=TRUE)
model_2
```
This model is compared to the old model with 500 trees and found that former was better. the graph is also created with ggplot to see the difference.
```{r}
oob.error.data_2 <- data.frame(
Trees=rep(1:nrow(model_2$err.rate), times=3),
Type=rep(c("OOB", "Healthy", "Unhealthy"), each=nrow(model_2$err.rate)),
Error=c(model_2$err.rate[,"OOB"],
model_2$err.rate[,"Healthy"],
model_2$err.rate[,"Unhealthy"]))
ggplot(data=oob.error.data_2, aes(x=Trees, y=Error)) +
geom_line(aes(color=Type))
```
Here the error rate stabilize after 500 trees.
Now we consider the optimal number of nodes in the tree (it was 3 when we look at the summary of model)
```{r}
#create an empty vector that holds 10 values
oob.values <- vector(length=10)
# If we want to compare this random forest to others with different values for mtry (to control how many variables are considered at each step), This loop will test different numbers of variables at each step
for (i in 1:10) {
temp.model <- randomForest(hd ~ ., data=data.imputed, mtry=i, ntree=1000)
oob.values[i] <- temp.model$err.rate[nrow(temp.model$err.rate),1]
}
oob.values
```
here we see the node number set to 5 has the lowest error rate. ie: 0.1683168, choose node no 5 for a better model creation.
Now we create an MDS plot using the random forest model created, this will help to understand how the samples are related to each other.
```{r}
#create a distance matrix by converting proximity matrix.
distance.matrix <- as.dist(1-model$proximity)
#run cmdscale on distance matrix
mds.stuff <- cmdscale(distance.matrix, eig=TRUE, x.ret=TRUE)
#calculate % of variation in distance matrix
mds.var.per <- round(mds.stuff$eig/sum(mds.stuff$eig)*100, 1)
#format the data for ggplot
mds.values <- mds.stuff$points
mds.data <- data.frame(Sample=rownames(mds.values),
X=mds.values[,1],
Y=mds.values[,2],
Status=data.imputed$hd)
ggplot(data=mds.data, aes(x=X, y=Y, label=Sample)) +
geom_text(aes(color=Status)) +
theme_bw() +
xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) +
ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) +
ggtitle("MDS plot using (1 - Random Forest Proximities)")
```
The healthy samples are on the left side and unhealthy samples are on the right side. The patient 251 ( on the right side along with unhealthy patinets) was misdiagnosed. the X axis accounts for 15.1 % of the variation in the distance matrix and Y axis only accounts for 5.5% of the variation. This indiacate that the big difference is on the X axis.