This repository has been archived by the owner on Sep 30, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 41
/
2018-03-14 Multivariate Linear Models Practice.Rmd
238 lines (190 loc) · 8.29 KB
/
2018-03-14 Multivariate Linear Models Practice.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "Multivariate Linear Models Practice"
author: "neoglez"
date: "March 7, 2018"
output:
html_document:
code_folding: hide
fig_height: 7
fig_retina: null
pdf_document: default
---
## `knitr` Options
```{r}
knitr::opts_chunk$set(fig.align = "center")
knitr::opts_chunk$set(results = "hold")
```
## `R` Packages
```{r, message=FALSE}
```
## Pratice
### Easy.
**5E1.** Which of the linear models below are multiple linear regressions?
(1) $\mu_i = \alpha + \beta x_i$
(2) $\mu_i = \beta_x x_i + \beta_z z_i$
(3) $\mu_i = \alpha + \beta(x_i - z_i)$
(4) $\mu_i = \alpha + \beta_x x_i + \beta_z z_i$
$(2)$ which has two predictors ($x_i, z_i$), we can consider the intercept in this case to be $\alpha = 0$
$(3)$ is also mutivariate ($x_i, z_i$). In this case the coefficient is the same for both predictors.
$(4)$ has again two predictors and the coeficients are different.
### Medium.
**5M1.** Invent your own example of a spurious correlation. An outcome variable should be correlated with both predictor variables. But when both predictors are entered in the same model, the correlation between the outcome and one of the predictors should mostly vanish (or at least be greatly reduced).
[https://en.wikipedia.org/wiki/Spurious_relationship](example taken and adapted from wikipedia) Ice cream sales. These sales are the highest when the rate of drownings in city swimming pools is highest. The sales are also very high when there is a heat wave. To allege that ice cream sales cause drowning, or vice versa, would be to imply a spurious relationship beetween the two. In reality, a heat wave may have caused both. That is, by considering the temperature as a predictor, we would observe that the correlation between drowinings and ice sales would almost vanish.Let us simulate that. Following the overthinking on page 134, we asume $x_{real}$ to be the truly causal predictor, the temperature/heat wave that influence both the outcome $y$ (drowning) and a spurious predictor $x_{spur}$, ice cream sales.
```{r}
N <- 100 # number of cases
x_real <- rnorm( N, 2 ) # x_real, temperature as Gaussian with mean 0 and stddev 2
x_spur <- rnorm( N , x_real ) # x_spur, ice creame sales as Gaussian with mean=x_real
y <- rnorm( N , x_real ) # y, drowinings as Gaussian with mean=x_real
d <- data.frame(y,x_real,x_spur) # bind all together in data frame
plot(y ~ x_spur)
plot(y ~ x_real)
```
### Hard.
All three exercises below use the same data, data(foxes) (part of rethinking). The urban fox (Vulpes vulpes) is a successful exploiter of human habitat. Since urban foxes move in packs and defend territories, data on habitat quality and population density is also included. The data frame has five columns:
(1) group: Number of the social group the individual fox belongs to
(2) avgfood: The average amount of food available in the territory
(3) groupsize: The number of foxes in the social group
(4) area: Size of the territory
(5) weight: Body weight of the individual fox
**5H1.** Fit two bivariate Gaussian regressions, using map: (1) body weight as a linear function of territory size (area), and (2) body weight as a linear function of groupsize. Plot the results of these regressions, displaying the MAP regression line and the 95% interval of the mean. Is either variable important for predicting fox body weight?
(1) Body weight as a linear function of territory size (area). Plot the result of this regression.
```{r}
# Load library and data
library(rethinking)
data(foxes)
d <- foxes
# standardize territory size (predictor)
d$area.s <- (d$area-mean(d$area))/
sd(d$area)
# fit model with map
m5h1.1 <- map(
alist(
d$weight ~ dnorm( mu , sigma ) ,
mu <- a + bTs * area.s ,
a ~ dnorm( 10 , 10 ) ,
bTs ~ dnorm( 0 , 1 ) ,
sigma ~ dunif( 0 , 10 )
) , data = d )
# compute percentile interval of mean
TS.seq <- seq( from=-3 , to=3.5 , length.out=30 )
mu <- link( m5h1.1 , data=data.frame(area.s=TS.seq) )
mu.PI <- apply( mu , 2 , PI, prob=0.95 )
# plot it all
plot( weight ~ area.s , data=d , col=rangi2 )
abline( m5h1.1 )
shade( mu.PI , TS.seq )
```
(2) Body weight as a linear function of groupsize. Plot the result of this regression.
```{r}
# Load library and data
library(rethinking)
data(foxes)
d <- foxes
# standardize group size (predictor)
d$groupsize.s <- (d$groupsize-mean(d$groupsize))/
sd(d$groupsize)
# fit model with map
m5h1.2 <- map(
alist(
d$weight ~ dnorm( mu , sigma ) ,
mu <- a + bGs * groupsize.s ,
a ~ dnorm( 10 , 10 ) ,
bGs ~ dnorm( 0 , 1 ) ,
sigma ~ dunif( 0 , 10 )
) , data = d )
# compute percentile interval of mean
GS.seq <- seq( from=-3 , to=3.5 , length.out=100 )
mu <- link( m5h1.2 , data=data.frame(groupsize.s=GS.seq) )
mu.PI <- apply( mu , 2 , PI, prob=0.95 )
# plot it all
plot( weight ~ groupsize.s , data=d , col=rangi2 )
abline( m5h1.2 )
shade( mu.PI , GS.seq )
```
- Is either variable important for predicting fox body weight?
By looking at the grapichs we observe very weak correlations. Therefore we conclude that either the territory size nor group size are important to predict body weight.
**5H2.** Now fit a multiple linear regression with weight as the outcome and both area and groupsize
as predictor variables. Plot the predictions of the model for each predictor, holding the other predictor
constant at its mean. What does this model say about the importance of each variable? Why do you
get different results than you got in the exercise just above?
- fit a multiple linear regression with weight as the outcome and both area and groupsize
as predictor variables.
```{r}
# Load library and data
library(rethinking)
data(foxes)
d <- foxes
# standardize territory size (predictor)
d$area.s <- (d$area-mean(d$area))/
sd(d$area)
# standardize group size (predictor)
d$groupsize.s <- (d$groupsize-mean(d$groupsize))/
sd(d$groupsize)
# fit model with map
m5h2 <- map(
alist(
d$weight ~ dnorm( mu , sigma ) ,
mu <- a + bTs * area.s + bGs * groupsize.s ,
a ~ dnorm( 10 , 10 ) ,
bTs ~ dnorm( 0, 1 ),
bGs ~ dnorm( 0 , 1 ) ,
sigma ~ dunif( 0 , 10 )
) , data = d )
precis(m5h2)
plot( precis(m5h2))
```
- Plot the predictions of the model for each predictor, holding the other predictor
constant at its mean.
We consider this plot to be *counterfactual* because we are asked to *see* how the prediction changes as we change one predictor at a time while holding the other constant at its mean.
```{r}
# plot predictions of model for territory size keeping group size contant at its mean
#####################################################################################
# prepare new counterfactual data
A.avg <- mean( d$groupsize.s )
R.seq <- seq( from=-3 , to=3.5 , length.out=100 )
pred.data <- data.frame(
area.s=R.seq,
groupsize.s=A.avg
)
# compute counterfactual mean weight (mu)
mu <- link( m5h2 , data=pred.data )
mu.mean <- apply( mu , 2 , mean )
mu.PI <- apply( mu , 2 , PI )
# simulate counterfactual weight outcomes
#R.sim <- sim( m5h2 , data=pred.data , n=1e4 )
#R.PI <- apply( R.sim , 2 , PI )
# display predictions, hiding raw data with type="n"
plot( weight ~ area.s , data=d , type="n" )
mtext( "groupsze.s = 0" )
lines( R.seq , mu.mean )
shade( mu.PI , R.seq )
#shade( R.PI , R.seq )
```
```{r}
# plot predictions of model for group size keeping territory size contant at its mean
#####################################################################################
# prepare new counterfactual data
A.seq <- seq( from=-3 , to=3.5 , length.out=100 )
R.avg <- mean( d$area.s )
pred.data <- data.frame(
area.s=R.avg,
groupsize.s=A.seq
)
# compute counterfactual mean weight (mu)
mu <- link( m5h2 , data=pred.data )
mu.mean <- apply( mu , 2 , mean )
mu.PI <- apply( mu , 2 , PI )
# simulate counterfactual weight outcomes
#R.sim <- sim(m5h2, data=pred.data, n=1e4)
#R.PI <- apply(R.sim, 2, PI)
# display predictions, hiding raw data with type="n"
plot( weight ~ groupsize.s , data=d , type="n" )
mtext( "area.s = 0" )
lines( R.seq , mu.mean )
shade( mu.PI , R.seq )
#shade( R.PI , R.seq )
```
- What does this model say about the importance of each variable?
We observe that the variables are indeed important but one is positively correlated (territory size) and the other negatively correlated (group size).
- Why do you get different results than you got in the exercise just above?
Because the correlation between the variables and the outcome is masked.