Skip to content

Commit

Permalink
Adding week 12 material
Browse files Browse the repository at this point in the history
  • Loading branch information
eveskew committed Apr 13, 2020
1 parent 0f8535e commit 095a5f2
Show file tree
Hide file tree
Showing 4 changed files with 362 additions and 0 deletions.
54 changes: 54 additions & 0 deletions homework/hw_week_12.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: "EEEB UN3005/GR5005 \nHomework - Week 12 - Due 23 Apr 2020"
author: "USE THE NUMERIC PORTION OF YOUR UNI HERE"
output: pdf_document
fontsize: 12pt
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
```


**Homework Instructions:** Complete this assignment by writing code in the code chunks provided. If required, provide written explanations **below** the relevant code chunks. Replace "USE THE NUMERIC PORTION OF YOUR UNI HERE" in the document header with the numbers appearing in your UNI. When complete, knit this document within RStudio to generate a PDF file. Please review the resulting PDF to ensure that all content relevant for grading (i.e., code, code output, and written explanations) appears in the document. Rename your PDF document according to the following format: hw_week_12_UNInumbers.pdf. Upload this final homework document to CourseWorks by 5 pm on the due date.


After loading the `rethinking` package, if you run `data(salamanders)` you'll find a dataset of salamander counts (the `SALAMAN` variable) recorded at 47 forest plots. Given a common exposure across forest plots (i.e., if data was collected at a regular interval for all plots), then this count data is ideal for modeling as a Poisson variable.


## Problem 1 (4 points)

Model the relationship between salamander count (`SALAMAN`) and percentage of vegetation cover on the forest floor (`PCTCOVER`) using a Poisson generalized linear model (GLM). Use priors of `dnorm(0, 10)` and explicit start values of 0 for all model parameters.

Use `precis()` to report the 97% PI of fit model parameters. What does your model suggest about the directional effect of vegetation cover on salamander counts?

```{r}
```


## Problem 2 (4 points)

Refit the same model as in Problem 1, this time using `map2stan()`, specifying four MCMC chains. Don't worry about the large amount of R console output that will turn up in your knit PDF document.

After you've fit the model, report the 97% HPDIs of model parameters using `precis()`, and use a method of your choice (two were shown in lecture) to display parameter trace plots from the fit model.

```{r}
```


## Problem 3 (2 points)

Generate a plot showing the raw salamander count data (against vegetation cover) along with model-based predictions from the Poisson GLM you fit in Problem 1. More specifically, plot a line showing the mean predicted salamander count and a shaded 97% HPDI interval for the predictions.

As a hint, use `sim()` to generate your predictions. This will require counterfactual data, so your prediction generation process will start by defining a sequence of predictor values to generate predictions for. From there, everything should be very similar to examples you've encountered previously in class.

Using your plot to help with interpretation, how does the model perform well and how does it perform poorly?

```{r}
```
52 changes: 52 additions & 0 deletions lab/lab_week_12.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: "EEEB UN3005/GR5005 \nLab - Week 12 - 13 and 15 April 2020"
author: "USE YOUR NAME HERE"
output: pdf_document
fontsize: 12pt
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(rethinking)
library(dplyr)
library(ggplot2)
```


**Updated Lab Instructions Due to Remote Teaching:** Complete this assignment by writing code in the code chunks provided. If required, provide written explanations below the relevant code chunks. When complete, knit this document within RStudio to generate a PDF, and upload that document to CourseWorks by 5 pm on 16 April.


# Link Functions and Poisson Regression


## Exercise 1: Importing and Visualizing the Kline Dataset

Import the Kline dataset that was shown in the *Statistical Rethinking* text and lecture. Add a log population size variable (`log_pop`) to the data frame for use as a predictor variable. Now, visualize the relationship between `log_pop` and `total_tools` using a scatter plot in `ggplot()`. Make the x-axis limits of your plot span from 0 to 15 and the y-axis limits of your plot span from 0 to 80. In addition, add the layer `geom_smooth(method = "lm", se = FALSE, fullrange = TRUE)` to your plot in order to display a linear trend line on top of the raw data.

```{r}
```


## Exercise 2: Fitting a Poisson GLM and a Standard Linear Model

First, fit a Poisson GLM to the Kline data (with `map()`), using `log_pop` as a predictor of total tool count. This model should replicate m10.12 from the *Statistical Rethinking* book, so reference the text if needed. After fitting the model, use `precis()` to display the 97% PIs for all model parameters.

Now, fit a standard linear model (with a Gaussian outcome distribution) to the Kline data, again using `log_pop` as a predictor of total tool count. Use the following priors: `dnorm(-50, 10)` for the intercept parameter, `dnorm(0, 10)` for the beta coefficient for `log_pop`, and `dunif(0, 10)` for the standard deviation parameter. Use `precis()` to display the 97% PIs for all model parameters after you've fit the model.

```{r}
```


## Exercise 3: Comparing Model-based Predictions

Imagine we discover a new Oceanic island with a population of 150 people (log population size of 5.01). Using `sim()`, generate predictions for total tool count on this island for both the Poisson GLM and the standard linear model. Output some of the predictions from each model (using `head()`, for example) just to see what they look like, and report the mean value of the predictions generated from both models.

Which model suggests a higher total tool count for this hypothetical island? Do you think both of these models generate sensible predictions for total tool count? Why or why not?

```{r}
```
256 changes: 256 additions & 0 deletions lecture/lecture_week_12.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@


library(rethinking)
library(dplyr)

#==========================================================


# Import and prep the country data

data(rugged)
d <- rugged

# Make log version of outcome (GDP in year 2000)
d$log_gdp <- log(d$rgdppc_2000)

# Extract countries with GDP data
dd <- d[complete.cases(d$rgdppc_2000), ]

#==========================================================


# Fit a multiple regression with an interaction term

m8.1 <- map(
data = dd,
alist(
log_gdp ~ dnorm(mu, sigma),
mu <- a + bA*cont_africa + bR*rugged + bAR*rugged*cont_africa,
a ~ dnorm(0, 100),
bA ~ dnorm(0, 10),
bR ~ dnorm(0, 10),
bAR ~ dnorm(0, 10),
sigma ~ dunif(0, 10)
)
)


# Fit a multiple regression with an interaction term
# using map2stan()

# Trim data to ensure we don't run into any Stan issues
dd.trim <- dd %>%
select(log_gdp, rugged, cont_africa)

# Fit the model using map2stan()
m8.1stan <- map2stan(
data = dd.trim,
alist(
log_gdp ~ dnorm(mu, sigma),
mu <- a + bA*cont_africa + bR*rugged + bAR*rugged*cont_africa,
a ~ dnorm(0, 100),
bA ~ dnorm(0, 10),
bR ~ dnorm(0, 10),
bAR ~ dnorm(0, 10),
sigma ~ dcauchy(0, 2)
)
)

# Compare model fits using precis()
precis(m8.1, prob = 0.97)
precis(m8.1stan, prob = 0.97)

# Other Stan fit model summary functions
show(m8.1stan)
summary(m8.1stan)

# To extract the actual samples embedded in the fit model
# object...
e8.1 <- extract.samples(m8.1stan)
summary(e8.1)

#==========================================================


# Plot parameter trace plots

# Using the rethinking plotting method
plot(m8.1stan)
plot(m8.1stan, window = c(1000, 2000)) # modify x-axis

# Using the rstan plotting method
rstan::traceplot(m8.1stan@stanfit,
pars = c("a", "bR", "bA", "bAR", "sigma"))

#==========================================================


# Reminders about the Poisson distribution

# To generate random samples from the Poisson:
rpois(n = 10, lambda = 5)

# Lambda can be any positive number:
rpois(n = 10, lambda = 0.536787)
# Or 0:
rpois(n = 10, lambda = 0)
# But it can't be negative:
rpois(n = 10, lambda = -1)

# The Poisson probability mass function:
dpois(0, lambda = 5)
dpois(3, lambda = 5)
dpois(5, lambda = 5)
dpois(20, lambda = 5)
dpois(100, lambda = 5)

#==========================================================


# Fitting a Poisson GLM


# Import the Kline dataset
data(Kline)
d <- Kline
d

# Generate modified predictor variables
d$log_pop <- log(d$population)
d$contact_high <- ifelse(d$contact == "high", 1, 0)
d

# Fit a Poisson GLM with the effects of log population
# size, contact rate, and their interaction
m10.10 <- map(
data = d,
alist(
total_tools ~ dpois(lambda),
log(lambda) <- a + bp*log_pop + bc*contact_high + bpc*log_pop*contact_high,
a ~ dnorm(0, 100),
bp ~ dnorm(0, 1),
bc ~ dnorm(0, 1),
bpc ~ dnorm(0, 1)
)
)

precis(m10.10, prob = 0.97)

#==========================================================


# Generate model-based predictions (the manual way)

# Get 5,000 posterior samples from the fit model
post <- extract.samples(m10.10, n = 5000)

# Generate expected lambda values for an island with a
# log population size of 8 and high contact rate by
# applying the inverse link function to the linear model
# for lambda
lambda.high <-
exp(post$a + post$bp*8 + post$bc*1 + post$bpc*8*1)

# Generate expected lambda values for an island with a
# log population size of 8 and low contact rate by
# applying the inverse link function to the linear model
# for lambda
lambda.low <-
exp(post$a + post$bp*8 + post$bc*0 + post$bpc*8*0)

# Remember, these represent the expected values
# for lambda, NOT the actual counts
head(lambda.high)
head(lambda.low)

# To get count predictions, we need to feed these
# lambda values through rpois()
preds.high <- rpois(n = 5000, lambda.high)
preds.low <- rpois(n = 5000, lambda.low)

head(preds.high)
head(preds.low)


# Or we could generate the same predictions by using the
# convenience function sim(), feeding it counterfactual
# data

# Generate counterfactual data for high and low contact
# islands
counterfactual.high <-
data.frame(log_pop = 8, contact_high = 1)
counterfactual.low <-
data.frame(log_pop = 8, contact_high = 0)

# Generate predictions using sim()
preds.high.sim <-
sim(m10.10, n = 5000, data = counterfactual.high)
preds.low.sim <-
sim(m10.10, n = 5000, data = counterfactual.low)

head(preds.high.sim)
head(preds.low.sim)


# Visualize the predictions

par(mfrow = c(2, 2))

simplehist(preds.high,
xlab = "", ylab = "",
xlim = c(0, 60), ylim = c(0, 500),
col = "red",
main = "High contact preds, manual")

simplehist(preds.high.sim,
xlab = "", ylab = "",
xlim = c(0, 60), ylim = c(0, 500),
col = alpha("red", 0.5),
main = "High contact preds, sim")

simplehist(preds.low,
xlab = "", ylab = "",
xlim = c(0, 60), ylim = c(0, 500),
col = "blue",
main = "Low contact preds, manual")

simplehist(preds.low.sim,
xlab = "", ylab = "",
xlim = c(0, 60), ylim = c(0, 500),
col = alpha("blue", 0.5),
main = "Low contact preds, sim")

#==========================================================


# Fit the same Poisson GLM using map2stan() with
# 4 Markov chains
m10.10stan <- map2stan(
data = d,
alist(
total_tools ~ dpois(lambda),
log(lambda) <- a + bp*log_pop + bc*contact_high + bpc*log_pop*contact_high,
a ~ dnorm(0, 100),
bp ~ dnorm(0, 1),
bc ~ dnorm(0, 1),
bpc ~ dnorm(0, 1)
),
chains = 4
)

# Compare model fits using precis()
precis(m10.10, prob = 0.97)
precis(m10.10stan, prob = 0.97)

# Extract samples
e <- extract.samples(m10.10stan)
summary(e)

# Plot parameter trace plots
plot(m10.10stan)
plot(m10.10stan, window = c(500, 2000))
plot(m10.10stan, window = c(1000, 2000))
rstan::traceplot(m10.10stan@stanfit,
pars = c("a", "bp", "bc", "bpc"))
Binary file added lecture/lecture_week_12.pptx
Binary file not shown.

0 comments on commit 095a5f2

Please sign in to comment.