01_Regression-0.Rmd

---
title: "Chapter 1: Linear regression"
subtitle: "Introduction"
author: "Joris Vankerschaver"
header-includes:
  - \useinnertheme[shadow=true]{rounded}
  - \usecolortheme{rose}
  - \setbeamertemplate{footline}[frame number]
  - \usepackage{color}
  - \usepackage{graphicx}
output: 
  beamer_presentation:
    theme: "default"
    keep_tex: true
    includes:
      in_header: columns.tex
---


```{r, include=F}
heights <- read.csv("./datasets/01-linear-regression/heights-2022.csv", stringsAsFactors = T)
m <- lm(Height ~ Palm.width, data=heights)
```

## Problem setting

- 26 observations from class of 2021-22 (19 female and 7 male) + 1 professor (**27 total**)
- Measurement of right \alert{palm width} and \alert{height} (both in cm).
- Random sample? From which population?
- Sources of bias, error?

\begin{block}{Research questions}
\begin{enumerate}
\item Is there an association between height and palm width?
\item Can we predict a person's height from their palm width?
\item If yes, how confident are we in these results?
\end{enumerate}
\end{block}

## Simple and multiple linear regression

- In this lecture, we build a **simple linear regression** model.
- Simple regression: effect on height of a single predictor (palm width)
- Multiple regression: multiple predictors (palm width, gender, year, ...)

## The raw data

```{r, echo=FALSE, fig.height=5, fig.width=6}
par(mfrow=c(1, 1))
plot(heights$Palm.width, heights$Height, 
     xlab = "Palm Width", ylab = "Height",
     pch = 21, bg = c("Pink", "Lightblue")[heights$Gender])
grid()
legend(x = "topleft", legend = levels(heights$Gender), 
       fill =c("Pink", "Lightblue"))

```

## Associating height with palm width

```{r, echo=FALSE, fig.height=5, fig.width=6}
par(mfrow=c(1, 1))
plot(heights$Palm.width, heights$Height, 
     xlab = "Palm Width", ylab = "Height",
     pch = 21, bg = c("Pink", "Lightblue")[heights$Gender])
grid()
legend(x = "topleft", legend = levels(heights$Gender), 
       fill =c("Pink", "Lightblue"))

pred <- data.frame(
  Palm.width=seq(min(heights$Palm.width), max(heights$Palm.width), by=0.1)
)
pc <- predict(m, interval="c", newdata = pred)
matlines(pred$Palm.width, pc, lty=c(1, 2, 2), col = "black")

a <- round(coef(m)[1], 2)
b <- round(coef(m)[2], 2)
legend(8.5, 163, paste("H =", a, "+", b, "W"), bg="white", box.col="white", adj=0.2)
```

## Via R

\footnotesize
```{r, echo=FALSE}
summary(m)
```
\normalsize

## Model diagnostics

```{r, echo=FALSE}
par(mfrow=c(2, 2))
plot(m)
```

## Predicting height from palm width

- Model: $E(H|W = w) = 87.45 + 9.91 \times w$. 

- Predicted expected height of a person with palm width 8.75cm:
$$
  E(H|W = 8.75) = 87.45 + 9.91 \times 8.75 =  174.17 \,\text{cm}
$$

Regression coefficients:

- **Intercept** (87.45cm): height of a hypothetical student with palms that are 0 cm wide. Often makes more sense after mean-centering.
- **Slope** (9.91): each extra cm in palm width is associated with an increase of 9.91 cm in height.


## Be careful with extrapolating

Predicting outside the range of the data can yield misleading results.

![](./images/01-linear-regression/xkcd-605.png)

[Source: XKCD](https://xkcd.com/605/)

## What is the uncertainty in our prediction?

Assuming that our model is good, how accurate are the predictions from it?

For prediction $E(H|W = 8.75) =  174.17 \,\text{cm}$:

- 95% confidence interval: $[171.27, 177.08]$. Uncertainty in **average prediction**.
- 95% prediction interval: $[162.56, 185.79]$. Uncertainty in **individual predictions**.

## What is the uncertainty in our prediction?

```{r, echo=FALSE, fig.height=5, fig.width=6}
par(mfrow=c(1, 1))
plot(heights$Palm.width, heights$Height, 
     xlab = "Palm Width", ylab = "Height",
     pch = 21, bg = c("Pink", "Lightblue")[heights$Gender])
grid()

pred <- data.frame(
  Palm.width=seq(min(heights$Palm.width), max(heights$Palm.width), by=0.1)
)
pi <- predict(m, interval="p", newdata = pred)
matlines(pred$Palm.width, pi, lty=c(1, 3, 3), col = "black")
ci <- predict(m, interval="c", newdata = pred)
matlines(pred$Palm.width, ci, lty=c(1, 2, 2), col = "black")

legend(7, 190, 
       legend = c("Prediction", "95% P.I.", "95% C.I."),
       lty = c(1, 3, 2))
```

## Association between predictor and outcome

The regression slope $\beta = 9.91$ measures the strength of the  association between palm width and height.

- If close to 0: no association
- If different from 0: some degree of association

How do we test whether $\beta$ is 0? 

```{r, echo=TRUE}
summary(m)$coefficients
```