README.Rmd

RDDtools: an R package for Regression Discontinuity Design
========================================================

**RDDtools** is a new R package under development, designed to offer a set of tools to run all the steps required for a Regression Discontinuity Design (RDD) Analysis, from primary data visualisation to discontinuity estimation, sensitivity and placebo testing. 


Installing **RDDtools**
-----------------------

This github website hosts the source code. One of the easiest ways to install the package from github is by using the R package **devtools**:

```{r eval=FALSE}
library(devtools)
install_github(repo="RDDtools", username="MatthieuStigler", subdir="RDDtools")
```

Note however the latest version of RDDtools only works with R 3.0, and that you might need to install  [Rtools](http://stat.ethz.ch/CRAN/bin/windows/Rtools/) if on Windows. 


Documentation
-----------------------
The (preliminary) documentation is available in the help files directly, as well as in the *vignette*. The vignette can be accessed from R with vignette("RDDtools"), or by accessing the [pdf](https://github.com/MatthieuStigler/RDDtools/raw/master/RDDtools/inst/doc/RDDtools.pdf) stored on this github. 

RDDtools: main features
-----------------------


+  Simple visualisation of the data using binned-plot: **plot()**

+ Bandwidth selection:
  + MSE-RDD bandwidth procedure of [Imbens and Kalyanaraman 2012]: **RDDbw_IK()**
  + MSE global bandwidth procedure of [Ruppert et al 1995]: **RDDbw_RSW()**
+ Estimation:
  +  RDD parametric estimation: **RDDreg_lm()** This includes specifying the polynomial order, including covariates with various specifications as advocated in [Imbens and Lemieux 2008].
  +  RDD local non-parametric estimation: **RDDreg_np()**. Can also include covariates, and allows different types of inference (fully non-parametric, or parametric approximation). 
  +  RDD generalised estimation: allows to use custom estimating functions to get the RDD coefficient. Could allow for example a probit RDD, or quantile regression.
+ Post-Estimation tools:
  + Various tools, to obtain predictions at given covariate values ( **RDDpred()** ), or to convert to other classes, to lm ( **as.lm()** ), or to the package *np* ( **as.npreg()** ). 
  + Function to do inference with clustered data: **clusterInf()** either using a cluster covariance matrix ( **vcovCluster()** ) or by a degrees of freedom correction (as in [Cameron et al. 2008]).
+ Regression sensitivity analysis:
  + Plot the sensitivity of the coefficient with respect to the bandwith: **plotSensi()**
  + *Placebo plot* using different cutpoints: **plotPlacebo()**
+ Design sensitivity analysis:
  + McCrary test of manipulation of the forcing variable: wrapper **dens_test()** to the function **DCdensity()** from package **rdd**. 
  + Test of equal means of covariates: **covarTest_mean()**
  + Test of equal density of covariates: **covarTest_dens()**
+ Datasets
  + Contains the seminal dataset of [Lee 2008]: **Lee2008**
  + Contains functions to replicate the Monte-Carlo simulations of [Imbens and Kalyanaraman 2012]: **gen_MC_IK()**

Using RDDtools: a quick example
-----------------------
**RDDtools** works in an object-oriented way: the user has to define once the characteristic of the data, creating a *RDDdata* object, on which different anaylsis tools can be applied. 

### Data preparation and visualisation
Load the package, and load the built-in dataset from [Lee 2008]:

```{r options, echo=FALSE}
opts_chunk$set(warning= FALSE, message=FALSE, fig.align="center", fig.path='figuresREADME/')
```


```{r}
library(RDDtools)
data(Lee2008)
```

Declare the data to be a *RDDdata* object:

```{r}
Lee2008_rdd <- RDDdata(y=Lee2008$y, x=Lee2008$x, cutpoint=0)
```


You can now directly summarise and visualise this data:

```{r dataPlot}
summary(Lee2008_rdd)
plot(Lee2008_rdd)
```

### Estimation

#### Parametric

Estimate parametrically, by fitting a 4th order polynomial:
```{r reg_para}
reg_para <- RDDreg_lm(RDDobject=Lee2008_rdd, order=4)
reg_para

plot(reg_para)
```


#### Non-parametric
As well as run a simple local regression, using the [Imbens and Kalyanaraman 2012] bandwidth:
```{r RegPlot}
bw_ik <- RDDbw_IK(Lee2008_rdd)
reg_nonpara <- RDDreg_np(RDDobject=Lee2008_rdd, bw=bw_ik)
print(reg_nonpara)
plot(x=reg_nonpara)

```

### Regression Sensitivity tests:

One can easily check the sensitivity of the estimate to different bandwidths:
```{r SensiPlot}
plotSensi(reg_nonpara, from=0.05, to=1, by=0.1)
```

Or run the Placebo test, estimating the RDD effect based on fake cutpoints:
```{r placeboPlot}
plotPlacebo(reg_nonpara)
```

### Design Sensitivity tests:

Design sensitivity tests check whether the discontinuity found can actually be attributed ot other causes. Two types of tests are available:

+ Discontinuity comes from manipulation: test whether there is possible manipulation around the cutoff, McCrary 2008 test: **dens_test()**
+ Discontinuity comes from other variables: should test whether discontinuity arises with covariates. Currently, only simple tests of equality of covariates around the threshold are available: 

#### Discontinuity comes from manipulation: McCrary test

use simply the function **dens_test()**, on either the raw data, or the regression output:
```{r DensPlot}
dens_test(reg_nonpara)
```

#### Discontinuity comes from covariates: covariates balance tests

Two tests available:
+ equal means of covariates: **covarTest_mean()**
+ equal density of covariates: **covarTest_dens()**


We need here to simulate some data, given that the Lee (2008) dataset contains no covariates.
We here simulate three variables, with the second having a different mean on the left and the right. 

```{r}
set.seed(123)
n_Lee <- nrow(Lee2008)
Z <- data.frame(z1 = rnorm(n_Lee, sd=2), 
                z2 = rnorm(n_Lee, mean = ifelse(Lee2008<0, 5, 8)), 
                z3 = sample(letters, size = n_Lee, replace = TRUE))
Lee2008_rdd_Z <- RDDdata(y = Lee2008$y, x = Lee2008$x, covar = Z, cutpoint = 0)
```


Run the tests:
```{r}
## test for equality of means around cutoff:
covarTest_mean(Lee2008_rdd_Z, bw=0.3)

## Can also use function covarTest_dis() for Kolmogorov-Smirnov test:
covarTest_dis(Lee2008_rdd_Z, bw=0.3)
```

Tests correctly reject equality of the second, and correctly do not reject equality for the first and third. 

  [Imbens and Kalyanaraman 2012]: http://ideas.repec.org/a/oup/restud/v79y2012i3p933-959.html "Imbens, G. & Kalyanaraman, K. (2012) Optimal Bandwidth Choice for the Regression Discontinuity Estimator, Review of Economic Studies, 79, 933-959"
  
  [Lee 2008]: http://ideas.repec.org/a/eee/econom/v142y2008i2p675-697.html "Lee, D. S. (2008) Randomized experiments from non-random selection in U.S. House elections, Journal of Econometrics, 142, 675-697"
  
  [Imbens and Lemieux 2008]: http://ideas.repec.org/a/eee/econom/v142y2008i2p615-635.html "Imbens, G. & Lemieux, T. (2008) Regression discontinuity designs: A guide to practice, Journal of Econometrics, Vol. 142(2), pages 615-635"
  
  [Cameron et al. 2008]: http://ideas.repec.org/a/tpr/restat/v90y2008i3p414-427.html "Cameron, Gelbach and Miller (2008) Bootstrap-Based Improvements for Inference with Clustered Errors, The Review of Economics and Statistics, Vol. 90(3), pages 414-427"
  
  [Ruppert et al 1995]: http://www.jstor.org/stable/2291516 "Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90, 1257–1270."