Adding week 12 material

eveskew · Apr 13, 2020 · 095a5f2 · 095a5f2
1 parent 0f8535e
commit 095a5f2
Show file tree

Hide file tree

Showing 4 changed files with 362 additions and 0 deletions.
diff --git a/homework/hw_week_12.Rmd b/homework/hw_week_12.Rmd
@@ -0,0 +1,54 @@
+---
+title: "EEEB UN3005/GR5005  \nHomework - Week 12 - Due 23 Apr 2020"
+author: "USE THE NUMERIC PORTION OF YOUR UNI HERE"
+output: pdf_document
+fontsize: 12pt
+---
+
+```{r setup, include = FALSE}
+
+knitr::opts_chunk$set(echo = TRUE)
+
+library(rethinking)
+```
+
+
+**Homework Instructions:** Complete this assignment by writing code in the code chunks provided. If required, provide written explanations **below** the relevant code chunks. Replace "USE THE NUMERIC PORTION OF YOUR UNI HERE" in the document header with the numbers appearing in your UNI. When complete, knit this document within RStudio to generate a PDF file. Please review the resulting PDF to ensure that all content relevant for grading (i.e., code, code output, and written explanations) appears in the document. Rename your PDF document according to the following format: hw_week_12_UNInumbers.pdf. Upload this final homework document to CourseWorks by 5 pm on the due date.
+
+
+After loading the `rethinking` package, if you run `data(salamanders)` you'll find a dataset of salamander counts (the `SALAMAN` variable) recorded at 47 forest plots. Given a common exposure across forest plots (i.e., if data was collected at a regular interval for all plots), then this count data is ideal for modeling as a Poisson variable.
+
+
+## Problem 1 (4 points)
+
+Model the relationship between salamander count (`SALAMAN`) and percentage of vegetation cover on the forest floor (`PCTCOVER`) using a Poisson generalized linear model (GLM). Use priors of `dnorm(0, 10)` and explicit start values of 0 for all model parameters.
+
+Use `precis()` to report the 97% PI of fit model parameters. What does your model suggest about the directional effect of vegetation cover on salamander counts?
+
+```{r}
+
+```
+
+
+## Problem 2 (4 points)
+
+Refit the same model as in Problem 1, this time using `map2stan()`, specifying four MCMC chains. Don't worry about the large amount of R console output that will turn up in your knit PDF document.
+
+After you've fit the model, report the 97% HPDIs of model parameters using `precis()`, and use a method of your choice (two were shown in lecture) to display parameter trace plots from the fit model.
+
+```{r}
+
+```
+
+
+## Problem 3 (2 points)
+
+Generate a plot showing the raw salamander count data (against vegetation cover) along with model-based predictions from the Poisson GLM you fit in Problem 1. More specifically, plot a line showing the mean predicted salamander count and a shaded 97% HPDI interval for the predictions. 
+
+As a hint, use `sim()` to generate your predictions. This will require counterfactual data, so your prediction generation process will start by defining a sequence of predictor values to generate predictions for. From there, everything should be very similar to examples you've encountered previously in class.
+
+Using your plot to help with interpretation, how does the model perform well and how does it perform poorly? 
+
+```{r}
+
+```
diff --git a/lab/lab_week_12.Rmd b/lab/lab_week_12.Rmd
@@ -0,0 +1,52 @@
+---
+title: "EEEB UN3005/GR5005  \nLab - Week 12 - 13 and 15 April 2020"
+author: "USE YOUR NAME HERE"
+output: pdf_document
+fontsize: 12pt
+---
+
+```{r setup, include = FALSE}
+
+knitr::opts_chunk$set(echo = TRUE)
+
+library(rethinking)
+library(dplyr)
+library(ggplot2)
+```
+
+
+**Updated Lab Instructions Due to Remote Teaching:** Complete this assignment by writing code in the code chunks provided. If required, provide written explanations below the relevant code chunks. When complete, knit this document within RStudio to generate a PDF, and upload that document to CourseWorks by 5 pm on 16 April.
+
+
+# Link Functions and Poisson Regression
+
+
+## Exercise 1: Importing and Visualizing the Kline Dataset
+
+Import the Kline dataset that was shown in the *Statistical Rethinking* text and lecture. Add a log population size variable (`log_pop`) to the data frame for use as a predictor variable. Now, visualize the relationship between `log_pop` and `total_tools` using a scatter plot in `ggplot()`. Make the x-axis limits of your plot span from 0 to 15 and the y-axis limits of your plot span from 0 to 80. In addition, add the layer `geom_smooth(method = "lm", se = FALSE, fullrange = TRUE)` to your plot in order to display a linear trend line on top of the raw data.
+
+```{r}
+
+```
+
+
+## Exercise 2: Fitting a Poisson GLM and a Standard Linear Model
+
+First, fit a Poisson GLM to the Kline data (with `map()`), using `log_pop` as a predictor of total tool count. This model should replicate m10.12 from the *Statistical Rethinking* book, so reference the text if needed. After fitting the model, use `precis()` to display the 97% PIs for all model parameters.
+
+Now, fit a standard linear model (with a Gaussian outcome distribution) to the Kline data, again using `log_pop` as a predictor of total tool count. Use the following priors: `dnorm(-50, 10)` for the intercept parameter,  `dnorm(0, 10)` for the beta coefficient for `log_pop`, and `dunif(0, 10)` for the standard deviation parameter. Use `precis()` to display the 97% PIs for all model parameters after you've fit the model.
+
+```{r}
+
+```
+
+
+## Exercise 3: Comparing Model-based Predictions
+
+Imagine we discover a new Oceanic island with a population of 150 people (log population size of 5.01). Using `sim()`, generate predictions for total tool count on this island for both the Poisson GLM and the standard linear model. Output some of the predictions from each model (using `head()`, for example) just to see what they look like, and report the mean value of the predictions generated from both models. 
+
+Which model suggests a higher total tool count for this hypothetical island? Do you think both of these models generate sensible predictions for total tool count? Why or why not?
+
+```{r}
+
+```
diff --git a/lecture/lecture_week_12.R b/lecture/lecture_week_12.R
@@ -0,0 +1,256 @@
+
+
+library(rethinking)
+library(dplyr)
+
+#==========================================================
+
+
+# Import and prep the country data
+
+data(rugged)
+d <- rugged
+
+# Make log version of outcome (GDP in year 2000)
+d$log_gdp <- log(d$rgdppc_2000)
+
+# Extract countries with GDP data
+dd <- d[complete.cases(d$rgdppc_2000), ]
+
+#==========================================================
+
+
+# Fit a multiple regression with an interaction term
+
+m8.1 <- map(
+  data = dd,
+  alist(
+    log_gdp ~ dnorm(mu, sigma),
+    mu <- a + bA*cont_africa + bR*rugged + bAR*rugged*cont_africa,
+    a ~ dnorm(0, 100),
+    bA ~ dnorm(0, 10),
+    bR ~ dnorm(0, 10),
+    bAR ~ dnorm(0, 10),
+    sigma ~ dunif(0, 10)
+  )
+)
+
+
+# Fit a multiple regression with an interaction term
+# using map2stan()
+
+# Trim data to ensure we don't run into any Stan issues
+dd.trim <- dd %>%
+  select(log_gdp, rugged, cont_africa)
+
+# Fit the model using map2stan()
+m8.1stan <- map2stan(
+  data = dd.trim,
+  alist(
+    log_gdp ~ dnorm(mu, sigma),
+    mu <- a + bA*cont_africa + bR*rugged + bAR*rugged*cont_africa,
+    a ~ dnorm(0, 100),
+    bA ~ dnorm(0, 10),
+    bR ~ dnorm(0, 10),
+    bAR ~ dnorm(0, 10),
+    sigma ~ dcauchy(0, 2)
+  )
+)
+
+# Compare model fits using precis()
+precis(m8.1, prob = 0.97)
+precis(m8.1stan, prob = 0.97)
+
+# Other Stan fit model summary functions
+show(m8.1stan)
+summary(m8.1stan)
+
+# To extract the actual samples embedded in the fit model 
+# object...
+e8.1 <- extract.samples(m8.1stan)
+summary(e8.1)
+
+#==========================================================
+
+
+# Plot parameter trace plots
+
+# Using the rethinking plotting method
+plot(m8.1stan)
+plot(m8.1stan, window = c(1000, 2000)) # modify x-axis
+
+# Using the rstan plotting method
+rstan::traceplot(m8.1stan@stanfit, 
+                 pars = c("a", "bR", "bA", "bAR", "sigma"))
+
+#==========================================================
+
+
+# Reminders about the Poisson distribution
+
+# To generate random samples from the Poisson:
+rpois(n = 10, lambda = 5)
+
+# Lambda can be any positive number:
+rpois(n = 10, lambda = 0.536787)
+# Or 0:
+rpois(n = 10, lambda = 0)
+# But it can't be negative:
+rpois(n = 10, lambda = -1)
+
+# The Poisson probability mass function:
+dpois(0, lambda = 5)
+dpois(3, lambda = 5)
+dpois(5, lambda = 5)
+dpois(20, lambda = 5)
+dpois(100, lambda = 5)
+
+#==========================================================
+
+
+# Fitting a Poisson GLM
+
+
+# Import the Kline dataset
+data(Kline)
+d <- Kline
+d
+
+# Generate modified predictor variables
+d$log_pop <- log(d$population)
+d$contact_high <- ifelse(d$contact == "high", 1, 0)
+d
+
+# Fit a Poisson GLM with the effects of log population
+# size, contact rate, and their interaction
+m10.10 <- map(
+  data = d,
+  alist(
+    total_tools ~ dpois(lambda),
+    log(lambda) <- a + bp*log_pop + bc*contact_high + bpc*log_pop*contact_high,
+    a ~ dnorm(0, 100),
+    bp ~ dnorm(0, 1),
+    bc ~ dnorm(0, 1),
+    bpc ~ dnorm(0, 1)
+  )
+)
+
+precis(m10.10, prob = 0.97)
+
+#==========================================================
+
+
+# Generate model-based predictions (the manual way)
+
+# Get 5,000 posterior samples from the fit model
+post <- extract.samples(m10.10, n = 5000)
+
+# Generate expected lambda values for an island with a
+# log population size of 8 and high contact rate by
+# applying the inverse link function to the linear model
+# for lambda
+lambda.high <- 
+  exp(post$a + post$bp*8 + post$bc*1 + post$bpc*8*1)
+
+# Generate expected lambda values for an island with a
+# log population size of 8 and low contact rate by
+# applying the inverse link function to the linear model
+# for lambda
+lambda.low <- 
+  exp(post$a + post$bp*8 + post$bc*0 + post$bpc*8*0)
+
+# Remember, these represent the expected values
+# for lambda, NOT the actual counts
+head(lambda.high)
+head(lambda.low)
+
+# To get count predictions, we need to feed these
+# lambda values through rpois()
+preds.high <- rpois(n = 5000, lambda.high)
+preds.low <- rpois(n = 5000, lambda.low)
+
+head(preds.high)
+head(preds.low)
+
+
+# Or we could generate the same predictions by using the
+# convenience function sim(), feeding it counterfactual
+# data
+
+# Generate counterfactual data for high and low contact
+# islands
+counterfactual.high <- 
+  data.frame(log_pop = 8, contact_high = 1)
+counterfactual.low <- 
+  data.frame(log_pop = 8, contact_high = 0)
+
+# Generate predictions using sim()
+preds.high.sim <- 
+  sim(m10.10, n = 5000, data = counterfactual.high)
+preds.low.sim <- 
+  sim(m10.10, n = 5000, data = counterfactual.low)
+
+head(preds.high.sim)
+head(preds.low.sim)
+
+
+# Visualize the predictions
+
+par(mfrow = c(2, 2))
+
+simplehist(preds.high, 
+           xlab = "", ylab = "",
+           xlim = c(0, 60), ylim = c(0, 500), 
+           col = "red", 
+           main = "High contact preds, manual")
+
+simplehist(preds.high.sim, 
+           xlab = "", ylab = "",
+           xlim = c(0, 60), ylim = c(0, 500), 
+           col = alpha("red", 0.5), 
+           main = "High contact preds, sim")
+
+simplehist(preds.low, 
+           xlab = "", ylab = "",
+           xlim = c(0, 60), ylim = c(0, 500), 
+           col = "blue", 
+           main = "Low contact preds, manual")
+
+simplehist(preds.low.sim, 
+           xlab = "", ylab = "",
+           xlim = c(0, 60), ylim = c(0, 500), 
+           col = alpha("blue", 0.5), 
+           main = "Low contact preds, sim")
+
+#==========================================================
+
+
+# Fit the same Poisson GLM using map2stan() with 
+# 4 Markov chains
+m10.10stan <- map2stan(
+  data = d,
+  alist(
+    total_tools ~ dpois(lambda),
+    log(lambda) <- a + bp*log_pop + bc*contact_high + bpc*log_pop*contact_high,
+    a ~ dnorm(0, 100),
+    bp ~ dnorm(0, 1),
+    bc ~ dnorm(0, 1),
+    bpc ~ dnorm(0, 1)
+  ),
+  chains = 4
+)
+
+# Compare model fits using precis()
+precis(m10.10, prob = 0.97)
+precis(m10.10stan, prob = 0.97)
+
+# Extract samples
+e <- extract.samples(m10.10stan)
+summary(e)
+
+# Plot parameter trace plots
+plot(m10.10stan)
+plot(m10.10stan, window = c(500, 2000))
+plot(m10.10stan, window = c(1000, 2000))
+rstan::traceplot(m10.10stan@stanfit, 
+                 pars = c("a", "bp", "bc", "bpc"))
diff --git a/lecture/lecture_week_12.pptx b/lecture/lecture_week_12.pptx