-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_Regression-0.Rmd
162 lines (120 loc) · 4.44 KB
/
01_Regression-0.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
title: "Chapter 1: Linear regression"
subtitle: "Introduction"
author: "Joris Vankerschaver"
header-includes:
- \useinnertheme[shadow=true]{rounded}
- \usecolortheme{rose}
- \setbeamertemplate{footline}[frame number]
- \usepackage{color}
- \usepackage{graphicx}
output:
beamer_presentation:
theme: "default"
keep_tex: true
includes:
in_header: columns.tex
---
```{r, include=F}
heights <- read.csv("./datasets/01-linear-regression/heights-2022.csv", stringsAsFactors = T)
m <- lm(Height ~ Palm.width, data=heights)
```
## Problem setting
- 26 observations from class of 2021-22 (19 female and 7 male) + 1 professor (**27 total**)
- Measurement of right \alert{palm width} and \alert{height} (both in cm).
- Random sample? From which population?
- Sources of bias, error?
\begin{block}{Research questions}
\begin{enumerate}
\item Is there an association between height and palm width?
\item Can we predict a person's height from their palm width?
\item If yes, how confident are we in these results?
\end{enumerate}
\end{block}
## Simple and multiple linear regression
- In this lecture, we build a **simple linear regression** model.
- Simple regression: effect on height of a single predictor (palm width)
- Multiple regression: multiple predictors (palm width, gender, year, ...)
## The raw data
```{r, echo=FALSE, fig.height=5, fig.width=6}
par(mfrow=c(1, 1))
plot(heights$Palm.width, heights$Height,
xlab = "Palm Width", ylab = "Height",
pch = 21, bg = c("Pink", "Lightblue")[heights$Gender])
grid()
legend(x = "topleft", legend = levels(heights$Gender),
fill =c("Pink", "Lightblue"))
```
## Associating height with palm width
```{r, echo=FALSE, fig.height=5, fig.width=6}
par(mfrow=c(1, 1))
plot(heights$Palm.width, heights$Height,
xlab = "Palm Width", ylab = "Height",
pch = 21, bg = c("Pink", "Lightblue")[heights$Gender])
grid()
legend(x = "topleft", legend = levels(heights$Gender),
fill =c("Pink", "Lightblue"))
pred <- data.frame(
Palm.width=seq(min(heights$Palm.width), max(heights$Palm.width), by=0.1)
)
pc <- predict(m, interval="c", newdata = pred)
matlines(pred$Palm.width, pc, lty=c(1, 2, 2), col = "black")
a <- round(coef(m)[1], 2)
b <- round(coef(m)[2], 2)
legend(8.5, 163, paste("H =", a, "+", b, "W"), bg="white", box.col="white", adj=0.2)
```
## Via R
\footnotesize
```{r, echo=FALSE}
summary(m)
```
\normalsize
## Model diagnostics
```{r, echo=FALSE}
par(mfrow=c(2, 2))
plot(m)
```
## Predicting height from palm width
- Model: $E(H|W = w) = 87.45 + 9.91 \times w$.
- Predicted expected height of a person with palm width 8.75cm:
$$
E(H|W = 8.75) = 87.45 + 9.91 \times 8.75 = 174.17 \,\text{cm}
$$
Regression coefficients:
- **Intercept** (87.45cm): height of a hypothetical student with palms that are 0 cm wide. Often makes more sense after mean-centering.
- **Slope** (9.91): each extra cm in palm width is associated with an increase of 9.91 cm in height.
## Be careful with extrapolating
Predicting outside the range of the data can yield misleading results.
![](./images/01-linear-regression/xkcd-605.png)
[Source: XKCD](https://xkcd.com/605/)
## What is the uncertainty in our prediction?
Assuming that our model is good, how accurate are the predictions from it?
For prediction $E(H|W = 8.75) = 174.17 \,\text{cm}$:
- 95% confidence interval: $[171.27, 177.08]$. Uncertainty in **average prediction**.
- 95% prediction interval: $[162.56, 185.79]$. Uncertainty in **individual predictions**.
## What is the uncertainty in our prediction?
```{r, echo=FALSE, fig.height=5, fig.width=6}
par(mfrow=c(1, 1))
plot(heights$Palm.width, heights$Height,
xlab = "Palm Width", ylab = "Height",
pch = 21, bg = c("Pink", "Lightblue")[heights$Gender])
grid()
pred <- data.frame(
Palm.width=seq(min(heights$Palm.width), max(heights$Palm.width), by=0.1)
)
pi <- predict(m, interval="p", newdata = pred)
matlines(pred$Palm.width, pi, lty=c(1, 3, 3), col = "black")
ci <- predict(m, interval="c", newdata = pred)
matlines(pred$Palm.width, ci, lty=c(1, 2, 2), col = "black")
legend(7, 190,
legend = c("Prediction", "95% P.I.", "95% C.I."),
lty = c(1, 3, 2))
```
## Association between predictor and outcome
The regression slope $\beta = 9.91$ measures the strength of the association between palm width and height.
- If close to 0: no association
- If different from 0: some degree of association
How do we test whether $\beta$ is 0?
```{r, echo=TRUE}
summary(m)$coefficients
```