Skip to content

Commit a360b04

Browse files
change to simplystats blog format
1 parent 43c2de7 commit a360b04

File tree

3,808 files changed

+40599
-346556
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

3,808 files changed

+40599
-346556
lines changed

404.html

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
layout: default
3+
title: "404: Page not found"
4+
permalink: 404.html
5+
---
6+
7+
<div class="page">
8+
<h1 class="page-title">404: Page not found</h1>
9+
<p class="lead">Sorry, we've misplaced that URL or it's pointing to something that doesn't exist. <a href="{{ site.baseurl }}/">Head back home</a> to try finding it again.</p>
10+
</div>

404.md

Lines changed: 0 additions & 9 deletions
This file was deleted.

CNAME

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
simplystatistics.org

LICENSE

Lines changed: 0 additions & 22 deletions
This file was deleted.

README.md

Lines changed: 0 additions & 6 deletions
This file was deleted.
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
---
2+
title: Time Series Analysis in Biomedical Science - What You Really Need to Know
3+
author: roger
4+
comments: true
5+
layout: post
6+
output:
7+
html_document:
8+
keep_md: yes
9+
---
10+
11+
For a few years now I have given a guest lecture on time series analysis in our School's *Environmental Epidemiology* course. The basic thrust of this lecture is that you should generally ignore what you read about time series modeling, either in papers or in books. The reason is because I find much of the time series literature is not particularly helpful when doing analyses in a biomedical or population health context, which is what I do almost all the time.
12+
13+
## Prediction vs. Inference
14+
15+
First, most of the literature on time series models tends to assume that you are interested in doing prediction---forecasting future values in a time series. I almost am never doing this. In my work looking at air pollution and mortality, the goal is never to find the best model that predicts mortality. In particular, if our goal were to predict mortality, we would probably *never include air pollution as a predictor*. This is because air pollution has an inherently weak association with mortality at the population, whereas things like temperature and other seasonal factors tend to have a much stronger association.
16+
17+
What I *am* interested in doing is estimating an association between changes in air pollution levels and mortality and making some sort of inference about that association, either to a broader population or to other time periods. The challenges in these types of analyses include estimating weak associations in the presence of many stronger signals and appropriately adjusting for any potential confounding variables that similarly vary over time.
18+
19+
The reason the distinction between prediction and inference is important is that focusing on one vs. the other can lead you to very different model building strategies. Prediction modeling strategies will always want you to include into the model factors that are strongly correlated with the outcome and explain a lot of the outcome's variation. If you're trying to do inference and use a prediction modeling strategy, you may make at least two errors:
20+
21+
1. You may conclude that your key predictor of interest (e.g. air pollution) is not important because the modeling strategy didn't deem to include it
22+
2. You may omit important potential confounders because they have a weak releationship with the outcome (but maybe have a strong relationship with your key predictor). For example, one class of potential confounders in air pollution studies is other pollutants, which tend to be weakly associated with mortality but may be strongly associated with your pollutant of interest.
23+
24+
## Random vs. Fixed
25+
26+
Another area where I feel much time series literature differs from my practice is on the whether to focus on fixed effects or random effects. Most of what you might think of when you think of time series models (i.e. AR models, MA models, GARCH, etc.) focuses on modeling the *random* part of the model. Often, you end up treating time series data as random because you simply do not have any other data. But the reality is that in many biomedical and public health applications, patterns in time series data can be explained by clearly understood fixed patterns.
27+
28+
For example, take this time series here. It is lower at the beginning and at the end of the series, with higher level sin the middle of the period.
29+
30+
![Time series with seasonal pattern 1](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_fixed.png)
31+
32+
It's possible that this time series could be modeled with an auto-regressive (AR) model or maybe an auto-regressive moving average (ARMA) model. Or it's possible that the data are exhibiting a seasonal pattern. It's impossible to tell from the data whether this is a random formulation of this pattern or whether it's something you'd expect every time. The problem is that we usually onl have *one observation* from teh time series. That is, we observe the entire series only once.
33+
34+
Now take a look at this time series. It exhibits some of the same properties as the first series: it's low at the beginning and end and high in the middle.
35+
36+
![Time series with seasonal pattern 2](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_random.png)
37+
38+
Should we model this as a random process or as a process with a fixed pattern? That ultimately will depend on the what type of data this is and what we know about it. If it's air pollution data, we might do one thing, but if it's stock market data, we might do a totally different thing.
39+
40+
If one were to see replicates of the time series, we'd be able to resolve the fixed vs. random question. For example, because I simulated the data above, I can simulate another replicate and see what happens. In the plot below I show two replications from each of the processes.
41+
42+
![Fixed and random time series patterns](https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_both.png)
43+
44+
It's clear now that the time series on the top row has a fixed "seasonal" pattern while the time series on the bottom row is random (in fact it is simulated from an AR(1) model).
45+
46+
The point here is that I think very often we know things about the time series that we're modeling that we know introduced fixed variation into the data: seasonal patterns, day-of-week effects, and long-term trends. Furthermore, there may be other time-varying covariates that can help predict whatever time series we're modeling and can be put into the fixed part of the model (a.k.a regression modeling). Ultimately, when many of these fixed components are accounted for, there's relatively little of interest left in the residuals.
47+
48+
49+
## What to Model?
50+
51+
```{r,echo=FALSE}
52+
setwd("~/projects/nmmaps1987_2005")
53+
library(dplyr, warn.conflicts = FALSE)
54+
knitr::opts_chunk$set(echo = FALSE, comment = NA)
55+
det <- readRDS("det.rds")
56+
ds <- select(det, death, pm10tmean, date) %>% group_by(date) %>%
57+
summarize(death = sum(death), pm10 = sum(pm10tmean)) %>%
58+
filter(date <= "2000-12-31")
59+
ds$pm10[is.na(ds$pm10)] <- mean(ds$pm10, na.rm = TRUE)
60+
```
61+
62+
63+
So the question remains: What should I do? The short answer is to try to incorporate everything that you know about the data into the fixed/regression part of the model. Then take a look at the residuals and see if you still care.
64+
65+
Here's a quick example from my work in air pollution and mortality. The data are all-cause mortality and PM10 pollution from Detroit for the years 1987--2000. The question is whether daily mortaliy is associated with daily changes in ambient PM10 levels. We can try to answer that with a simple linear regression model:
66+
67+
68+
```{r}
69+
fit <- lm(death ~ pm10, data = ds)
70+
summary(fit)
71+
```
72+
73+
PM10 appears to be positively associated with mortality, but when we look at the autocorrelation function of the residuals, we see
74+
75+
76+
```{r}
77+
acf(resid(fit), lag.max = 365)
78+
```
79+
80+
If we see a seasonal-like pattern in the auto-correlation function, then that means there's a seasonal pattern in the residuals as well. Not good.
81+
82+
But okay, we can just model the seasonal component with an indicator of the season.
83+
84+
```{r}
85+
ds <- mutate(ds, season = factor(quarters(date)))
86+
fit <- lm(death ~ season + pm10, data = ds)
87+
summary(fit)
88+
```
89+
90+
Note that the coefficient for PM10, the coefficient of real interest, gets a little bigger when we add the seasonal component.
91+
92+
When we look at the residuals now, we see
93+
94+
```{r}
95+
acf(resid(fit), lag.max = 365)
96+
```
97+
98+
The seasonal pattern is gone, but we see that there's positive autocorrelation at seemingly long distances (~100s of days). This is usually an indicator that there's some sort of long-term trend in the data. Since we only care about the day-to-day changes in PM10 and mortality, it would make sense to remove any such long-term trend. I can do that by just including the date as a linear predictor.
99+
100+
101+
```{r}
102+
fit.reg <- lm(death ~ season + date + pm10, data = ds)
103+
summary(fit.reg)
104+
```
105+
106+
Now we can look at the autocorrelation function one last time.
107+
108+
```{r}
109+
acf(resid(fit.reg), lag.max = 40)
110+
```
111+
112+
113+
The ACF trails to zero reasonably quickly now, but there's still some autocorrelation at short lags up to about 15 days or so.
114+
115+
Now we can engage in some traditional time series modeling. We might want to model the residuals with an auto-regressive model over order *p*. What should *p* be? We can check by looking at the partial autocorrelation function (PACF).
116+
117+
```{r}
118+
pacf(resid(fit.reg))
119+
```
120+
121+
The PACF seems to suggest we should fit an AR(6) or AR(7) model. Let's use an AR(6) model and see how things look. We can use the `arima()` function for that.
122+
123+
```{r}
124+
m <- model.matrix(~ season + date + pm10, data = ds)
125+
y <- ds$death
126+
fit.ar <- arima(y, c(6, 0, 0), xreg = m, include.mean = FALSE)
127+
fit.ar
128+
```
129+
130+
Note that the coefficient for PM10 hasn't changed much from our initial models. The usual concern with not accounting for residual autocorrelation is that the variance/standard error of the coefficient of interest will be affected. In this case, there does not appear to be much of a difference between using the AR(6) to account for the residual autocorrelation and ignoring it altogether. Here's a comparison of the standard errors for each coefficient.
131+
132+
```{r}
133+
cm <- matrix(nrow = length(coef(fit.reg)), ncol = 2)
134+
rownames(cm) <- names(coef(fit.reg))
135+
colnames(cm) <- c("Naive", "AR model")
136+
cm[, 1] <- vcov(fit.reg) %>% diag %>% sqrt
137+
cm[, 2] <- sqrt(diag(fit.ar$var.coef))[names(coef(fit.reg))]
138+
round(cm, 6)
139+
```
140+
141+
The standard errors for the `pm10` variable are almost identical, while the standard errors for the other variables are somewhat bigger in the AR model.
142+
143+
## Conclusion
144+
145+
Ultimately, I've found that in many biomedical and public health applications, time series modeling is very different from what I read in the textbooks. The key takeaways are:
146+
147+
1. Make sure you know if you're doing **prediction** or **inference**. Most often you will be doing inference, in which case your modeling strategies will be quite different.
148+
149+
2. Focus separately on the **fixed** and **random** parts of the model. In particular, work with the fixed part of the model first, incorporating as much information as you can that will explain variability in your outcome.
150+
151+
3. Model the random part appropriately, after incorporating as much as you can into the fixed part of the model. Classical time series models may be of use here, but also simple robust variance estimators may be sufficient.

_config.yml

100644100755
Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
1-
# Site settings
2-
title: R Programming @ UC
3-
4-
description: > # this means to ignore newlines until "baseurl:"
5-
6-
baseurl: "http://r-uc.github.io/" # the subpath of your site, e.g. /blog/
7-
url: "http://r-uc.github.io/" # the base hostname & protocol for your site
8-
twitter_username: bradleyboehmke
9-
github_username: uc-r
10-
permalink: /:year/:month/:title.html
11-
excerpt_separator: <!--more-->
1+
# Dependencies
2+
gems: [jekyll-paginate]
3+
paginate: 5
4+
paginate_path: "page:num"
5+
6+
# Permalinks
7+
#
8+
# Use of `relative_permalinks` ensures post links from the index work properly.
9+
permalink: pretty
10+
11+
# Setup
12+
title: Simply Statistics
13+
tagline: 'A statistics blog by <a href="https://twitter.com/rafalab">Rafa Irizarry<a>, <a href="https://twitter.com/rdpeng">Roger Peng<a>, and <a href="https://twitter.com/jtleek">Jeff Leek<a>'
14+
description: '<img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/public/sslogo.png" height=60 </img> Follow us on twitter <a href="https://twitter.com/simplystats">@simplystats</a>'
15+
url: http://simplystats.github.io
16+
baseurl: ''
1217

13-
# Build settings
14-
markdown: kramdown

_data/authors.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Author details.
2+
jeff:
3+
name: Jeff Leek
4+
web: http://twitter.com/jtleek
5+
roger:
6+
name: Roger Peng
7+
web: http://twitter.com/rdpeng
8+
rafa:
9+
name: Rafa Irizarry
10+
web: http://twitter.com/rafalab
647 KB
Loading
120 KB
Loading

0 commit comments

Comments
 (0)