-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
202 lines (147 loc) · 7.44 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
title: "drconfseq"
output:
md_document:
variant: gfm
---
\newcommand{\EE}{\mathbb E}
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# drconfseq: Doubly robust confidence sequences
This package implements doubly robust confidence sequences which provide safe,
anytime-valid inference for the average treatment effect (ATE) from sequential data.
These allow you to
1. Update inferences about the ATE as new data become available;
2. Continuously monitor studies and adaptively stop or continue them for any reason; and
3. Control the type-I error at all stopping times, including random or data-dependent times.
Furthermore, these confidence sequences can be used in both randomized sequential experiments and observational studies with no unmeasured confounding.
The main methodological reference is
- Waudby-Smith, I., Arbour, D., Sinha, R., Kennedy, E. H., & Ramdas, A. (2021). Doubly robust confidence sequences for sequential causal inference. arXiv preprint [arXiv:2103.06476](https://arxiv.org/pdf/2103.06476.pdf).
## Installation
You can install the package using `devtools`:
```{r, eval=FALSE}
# Install devtools if you do not have it:
install.packages("devtools")
# Install the package using devtools::install_github
devtools::install_github("wannabesmith/drconfseq")
```
## Example usage
Let's jump into a simple example taken from ["Doubly-robust confidence sequences for sequential causal inference"](https://arxiv.org/pdf/2103.06476.pdf). Suppose that we wish to estimate the ATE in an observational setting with no unmeasured confounding.
```{r, warning=FALSE, message=FALSE}
library(drconfseq)
# Optional, can be used to speed up computations
library(parallel)
```
First, we will generate $n = 10^4$ observations each with 3 real-valued covariates from a trivariate Gaussian.
```{r}
n <- 10000
d <- 3
X <- cbind(1, matrix(rnorm(n*d), nrow = n))
```
Randomly assign subjects to treatment or control groups with equal probability.
```{r}
treatment <- rbinom(n = n, size = 1, prob = 1 / 2)
```
Define the regression function,
\begin{equation}
\label{eq:simulationRegressionFunction}
f^\star(x_1, x_2, x_3) := 1 - x_1^2 - 2\sin(x_2) + 3|x_3|,
\end{equation}
and the target parameter (which we will ensure is the average treatment effect by design),
\[ \mathrm{ATE} := 1. \]
```{r}
beta_mu <- c(1, -1, -2, 3)
mu <- function(x)
{
beta_mu %*% c(1, x[1]^2, sin(x[2]), abs(x[3]))
}
ATE <- 1
```
Finally, generate outcomes $y_1, \dots, y_n$ as
\[ y_i := f^\star(x_{i, 1}, x_{i, 2}, x_{i,3}) + \mathrm{ATE}\cdot \mathrm{treatment}_i + \epsilon_i, \]
where $\epsilon_i \sim t_{5}$ are drawn from a $t$-distribution with 5 degrees of freedom.
```{r}
y <- apply(X, MARGIN=1, FUN=mu) + treatment*ATE + rt(n, df=5)
```
We will build three estimators for the ATE with increasing degrees of complexity.
1. **Unadjusted**: uses the constant function $0$ for the outcome regression. This is equivalent to an inverse-probability-weighted estimator.
2. **Parametric**: estimates the outcome regression with a linear model.
3. **Super Learner**: estimates the outcome regression with a "Super Learner" weighted average of linear models and other machine learning algorithms including regression splines, random forests, and generalized additive models.
Since we are in a randomized experiments, all three estimators will consistently estimate and provide valid sequential inference for the ATE. However, since the underlying regression function is nonlinear, we expect the Super Learner to outperform the Parametric to outperform the Unadjusted.
```{r, cache = TRUE}
# Get SuperLearner prediction function for $\mu^1$.
sl_reg_1 <- get_SL_fn(SL.library = c("SL.earth", "SL.gam", "SL.glm", "SL.ranger"))
# Do the same for $\mu^0$.
sl_reg_0 <- sl_reg_1
# Get Parametric regression function for $\mu^1$.
parametric_reg_1 = get_SL_fn(SL.library = "SL.glm")
# Do the same for $\mu^0$.
parametric_reg_0 = get_SL_fn(SL.library = "SL.glm")
# Compute the confidence sequence at logarithmically-spaced time points
times <- unique(round(logseq(500, 10000, n = 10)))
alpha <- 0.1
# Set n_cores to 1 if you are not using the parallel package
n_cores <- detectCores()
confseq_SL <- confseq_ate(y, X, treatment, regression_fn_1 = sl_reg_1,
regression_fn_0 = sl_reg_0,
propensity_score_fn = function(y, X, newX){1/2},
t_opt = 1000, alpha=alpha,
times=times, n_cores = n_cores, cross_fit = TRUE)
confseq_glm <- confseq_ate(y, X, treatment, regression_fn_1 = parametric_reg_1,
regression_fn_0 = parametric_reg_0,
propensity_score_fn = function(y, X, newX){1/2},
t_opt = 1000, alpha=alpha,
times=times, n_cores = n_cores, cross_fit = TRUE)
confseq_unadj <- confseq_ate_unadjusted(y = y, treatment = treatment,
propensity_score = 1/2,
t_opt = 1000, alpha = alpha,
times = times)
```
Let us now plot the confidence sequence.
```{r, message = FALSE, fig.width=8, fig.height=4}
library(ggplot2)
plot_df <- data.frame(t = rep(times, 6),
bound = c(confseq_SL$l, confseq_SL$u,
confseq_glm$l, confseq_glm$u,
confseq_unadj$l, confseq_unadj$u),
upper_lower = c(rep('l', length(confseq_SL$l)),
rep('u', length(confseq_SL$u)),
rep('l', length(confseq_glm$l)),
rep('u', length(confseq_glm$u)),
rep('l', length(confseq_unadj$u)),
rep('u', length(confseq_unadj$u))),
Model = c(rep('Super Learner', 2*length(confseq_SL$l)),
rep('Parametric', 2*length(confseq_glm$l)),
rep('Unadjusted', 2*length(confseq_unadj$l))))
plt_cs <-
ggplot() +
geom_line(aes(x=t, y=bound, color=Model, linetype=Model),
data=plot_df[plot_df$upper_lower=='l',]) +
geom_line(aes(x=t, y=bound, color=Model, linetype=Model),
data=plot_df[plot_df$upper_lower=='u',]) +
geom_hline(yintercept=ATE, linetype='dashed', color='gray') +
ylab('Confidence sequences for the\naverage treatment effect') + xlab('Time') +
scale_x_log10(breaks = scales::trans_breaks("log10",
function(x) 10^x),
labels = scales::trans_format("log10",
scales::math_format(10^.x))) +
annotation_logticks(colour = "grey", side = "b") +
theme_minimal() +
theme(text = element_text(family="serif")) +
scale_color_brewer(palette="Set2")
plt_cs
```
## Run unit tests
First, clone the repository:
```{zsh, eval=FALSE}
git clone [email protected]:WannabeSmith/drconfseq.git
```
Then in an R console, run
```{r, eval=FALSE}
devtools::check("path/to/drconfseq")
```
## Reproduce the paper's plots
See the README in [this folder](paper_plots).
## Credits
The methods implemented in this package were developed by [Ian Waudby-Smith](https://ian.waudbysmith.com), [David Arbour](https://darbour.github.io/), [Ritwik Sinha](https://research.adobe.com/person/ritwik-sinha/), [Edward H. Kennedy](http://ehkennedy.com), and [Aaditya Ramdas](http://stat.cmu.edu/~aramdas).