dataanalysis3fs/08_RegressionModels.Rmd at main · hdsu-bioquant/dataanalysis3fs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "Basic statistics - Regression analysis"
author: "Carl Herrmann, Maïwen Caudron-Herger"
date: "`r Sys.Date()`"
output:
  rmdformats::readthedown:
    highlight: kate
---

```{r knitr_init, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)

## Global options
options(max.print="120")
opts_chunk$set(echo=TRUE,
	             cache=TRUE,
               prompt=FALSE,
               tidy=TRUE,
               comment=NA,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=120)
opts_knit$set(root.dir = "~/")
```

# 1. Objectives

In this last part, we want to learn how to build a **regression model** in order to make predictions on certain quantitative variables using other quantitative variables. The important steps here are:

-   **learning** the model
-   **testing** the model to evaluate its performances, and also check that the assumptions of the linearity are given
-   **predict** values based on new data points

We will use again the diabetes data set, and build
(1) simple linear regression models with a single explanatory variable, and
(2) multiple regression models in which we use several variables.
We will focus on predicting the **cholesterol level**.

# 2. Load the data and first analysis

```{r}
tmp = read.table('https://www.dropbox.com/s/zviurze7c85quyw/diabetes_full.csv?dl=1',header=TRUE,sep="\t")
```

We will limit the dataset to the numerical variables

```{r}
dat = tmp[,c(2,3,4,6,8,10,11,13,14,17,18)]
head(dat)
```
First, we have to perform some data cleaning and remove all patients with at least 1 "NA"
```{r}
# Determine which patients have at least 1 NA
i.na <- which( apply(dat, 1, function(x) {sum(is.na(x)) >0 }) )
# alternative: i.na <- which( apply(dat, 1, function(x) {sum(is.na(x))}) >0 )
#
# Select the patients without NAs
dat = dat[-i.na,]
```


Let's look at the correlation between variables with scatter plots!
```{r}
# Plot variable correlation scatterplots for all variables.
pairs(dat,col='red', pch=20, cex = 0.5)
```

That's a good overview, now we make a heatmap and visualize the correlation values.
```{r}
cor = cor(dat)
#
library(pheatmap)
pheatmap(cor,
         cluster_cols = FALSE,
         cluster_rows = FALSE,
         display_numbers = TRUE)
```

> What are the strongest correlations? Do they make sense?

Apart from displaying the raw correlation values, we can test if any of those are **statistically significant**, i.e. if they are significantly positive (one-sided test), negative (one-sided test), or non zero (two-sided test).

For example, there seems to be a positive correlation between *stabilized glucose* (stab.glu) and *hip circumference* (hip).

```{r}
## compute correlation
cor(dat$stab.glu,dat$hip)
##
## test for significance
cor.test(dat$stab.glu,dat$hip)
```

> read carefully the report, and make sure you understand this output. Check other pairs!

# 3. Univariate linear regression

> What is the most promising variable to predict cholesterol level?

We will use glycosilated hemoglobin (`glyhb`) as a predictor of the cholesterol level and the function **lm()**.

```{r}
l.g = lm(chol ~ glyhb, data=dat)
summary(l.g)
```

For a simple lineare regression (with only one explanatory variable), the p-value of the t-test for the slope is identical with the p-value of the F-test for the global model. *This will no longer be the case when including several explanatory variables!*

By the way, have we checked that a linear regression makes sense in this case? Remember that we have to check that:

-   the residuals are **normally distributed**
-   there is **no correlation** between the residuals and the explanatory variable

```{r}
# normal distribution of residuals?
hist(l.g$residuals,breaks=20)
##
qqnorm(l.g$residuals);qqline(l.g$residuals)
##
## correlation residuals x-values?
cor(dat$glyhb,l.g$residuals)
plot(dat$glyhb,l.g$residuals,pch=20)
```

> What is your overall opinion about the validity of the regression model here?

We can now use the model to predict the cholesterol values, and compare them to the real cholesterol values, since we have the information in this dataset.

```{r}
plot(dat$chol,l.g$fitted.values,pch=20,col='blue', xlab='Real values',ylab='Predicted values');abline(0,1,col='red')
```

Not really super convincing, right? Let's put more information into the model!

#4. Multiple regression model

Let us include all the information to try to predict cholesterol level:

```{r}
l.all = lm(chol ~ .,data=dat)
summary(l.all)
```

> Do you note something unexpected in this report?

We can see that the inclusion of several explanatory variables improves the regression. Check the $R^2$ values for example!

We can see that the variables `weight`, `waist` and `hip` do not reach the significance level. Two explanations are possible

1.  either these variables are indeed non-informative regarding the prediction of the cholesterol level
2.  or the mutual correlation between these 3 variables interferes with the model.

We can remove waist and hip for example, and redo the regression

```{r}
l.less = lm(chol ~ stab.glu + hdl + glyhb + age + height + weight  + bp.1s + bp.1d,data=dat)
summary(l.less)
```

We can see how the `weight` variable seems indeed to contribute to the prediction of the cholesterol. The removal of the strongly correlated variables has increased its significance. It now almost reaches significance at the 5% level!

> Check the result of the F-test to compare both models!

We can now check if the prediction are better that with the univariate model

```{r}
plot(dat$chol,l.less$fitted.values,pch=20,col='blue', xlab='Real values',ylab='Predicted values');abline(0,1,col='red')
```

Better? I would say so... To determine the accuracy, we can compute the so called **root mean squared error (RMSE)**

$$
RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_i-\hat{x_i})^2}
$$

```{r}
n = nrow(dat)
rmse = sqrt(1/n*sum(l.less$residuals^2))
rmse
```

Of course, this is cheating, right? We are predicting the values on **exactly** the same data we used to learn the model. In real machine learning, we need to perform **cross-validation**, i.e. learn the model on one part of the data (*training set*) and validate it on another set (*test set*).

Let us split the dataset in a test and training set randomly

```{r}
##
## take 200 random patients to form the training set
i.train = sample (1:nrow(dat),200)
##
dat.train = dat[i.train,]
dat.test = dat[-i.train,]
```

We now learn a new model on the train dataset:

```{r}
l.train = lm(chol ~ stab.glu + hdl + glyhb + age + height + weight  + bp.1s + bp.1d,data=dat.train)
summary(l.train)
```

We can compute the RMSE for the training dataset

```{r}
n.train = nrow(dat.train)
rmse.train = sqrt(1/n.train*sum(l.train$residuals^2))
rmse.train
```

Let use use that model to predict cholesterol values for the left out test set

```{r}
pred = predict(l.train,newdata = dat.test)
```

and compute the rmse:

```{r}
n.test = nrow(dat.test)
residuals = dat.test$chol - pred
rmse.test = sqrt(1/n.test*sum(residuals^2))
rmse.test
```

Of course, the RMSE is higher on the test dataset, since this does not include the data used for the establishment of the regression model; however, this is a more realistic estimation of the validity of the model, as it indicates how well the model could be extended to novel, independent data!

An important topic here is **feature selection**, i.e. finding the optimal and minimal set of explanatory variables that allow to predict well the output variable.


We now repeat the train/test split 10 times with each time a different random split; plot the 10 rmse.train and rmse.test values!
```{r}
set.seed(345)
RMSE <- sapply(1:10, function(x) {
  i.train = sample (1:nrow(dat),200)
  ##
  dat.train = dat[i.train,]
  dat.test = dat[-i.train,]
  ##
  l.train = lm(chol ~ stab.glu + hdl + glyhb + age + height + weight  + bp.1s + bp.1d,data=dat.train)
  ##
  n.train = nrow(dat.train)
  rmse.train = sqrt(1/n.train*sum(l.train$residuals^2))
  ##
  pred = predict(l.train,newdata = dat.test)
  ##
  n.test = nrow(dat.test)
  residuals = dat.test$chol - pred
  rmse.test = sqrt(1/n.test*sum(residuals^2))
  RMSE <- c(rmse.train,rmse.test)
  RMSE
})
#
#
plot(RMSE[1,], pch = 19, col = "orange", xlab = "Iteration", ylab = "RMSE values", ylim = c(min(RMSE),max(RMSE)+2));points(RMSE[2,], pch = 19, col = "purple");legend("topright", legend = c("rmse.train", "rmse.test"), col = c("orange","purple"), pch = c(19,19));abline(h = mean(RMSE[1,]), lty = 2, col = "orange");abline(h = mean(RMSE[2,]), lty = 2, col = "purple")
```


## EXERCISE

Exercise 1

* Take the multiple regression model, and try to remove step by step unimportant variables
* Can you identify an optimal set of variables? Which criterion would you choose for that?

------------------------------------------------------------------------

## Going further

### Using PCA to determine independent variables

To solve the problem of correlated variables, we can compute principal components on all explanatory variables

```{r}
# remove the first column as this is the output variable
pca = prcomp(dat[,-1]) # col 1 = "chol"
summary(pca)
```

We can display the definition of the principal components in terms of the original variables:

```{r}
par(las=2);barplot(pca$rotation[,1],horiz=TRUE,main='PC1',col='red')
```

Now perform the multiple linear regression model using the PCs as explanatory variables rather than the original variables.

> Check that the PCs are not correlated to each other any more by reproducing the correlation analysis we did at the beginning and display the correlation heatmap.

```{r}
cor.pca <- cor(pca$x)
pheatmap(cor.pca,
         cluster_cols = FALSE,
         cluster_rows = FALSE,
         display_numbers = TRUE)
```


Let us now run the linear regression with all PCs:

```{r}
l.pca = lm(dat$chol ~ pca$x)
summary(l.pca)
```

> Produce the barplots to display the definition of the significant principal components in this analysis

```{r, fig.width=12, fig.height=10}
par(mfrow=c(3,2),mar=c(2,4,2,2))
barplot(pca$rotation[,1],horiz=TRUE,main='PC1',col='red');barplot(pca$rotation[,3],horiz=TRUE,main='PC3',col='red');barplot(pca$rotation[,4],horiz=TRUE,main='PC4',col='red');barplot(pca$rotation[,5],horiz=TRUE,main='PC5',col='red');barplot(pca$rotation[,6],horiz=TRUE,main='PC6',col='red');barplot(pca$rotation[,10],horiz=TRUE,main='PC10',col='red')
```


> Reproduce this analysis for the variable `glyhb`

```{r, fig.width=12, fig.height=10}
l.glyhb = lm(glyhb ~ .,data=dat)
summary(l.glyhb)
#
pca.glyhb = prcomp(dat[,-4]) # col 4 = "glyhb"
summary(pca.glyhb)
#
cor.pca.glyhb <- cor(pca.glyhb$x)
pheatmap(cor.pca.glyhb,
         cluster_cols = FALSE,
         cluster_rows = FALSE,
         display_numbers = TRUE)
#
l.pca.glyhb = lm(dat$glyhb ~ pca.glyhb$x)
summary(l.pca.glyhb)
#
layout(matrix(c(1:3)))
par(las=2, cex=1);barplot(pca.glyhb$rotation[,1],horiz=TRUE,main='PC1',col='red');barplot(pca.glyhb$rotation[,3],horiz=TRUE,main='PC3',col='red');barplot(pca.glyhb$rotation[,5],horiz=TRUE,main='PC5',col='red')
#
```