Skip to content

stanford-policylab/opp

Repository files navigation

Open Policing Project (OPP)

Simple Bulk Downloads

Install python 3 if you do not have it, then follow the subsequent code from command line:

git clone https://github.com/stanford-policylab/opp.git # clone the repo
cd opp

# Example uses:
./download                            # download all locations with csv files to /tmp/opp_data
./download -h                         # see help for all commands
./download -t csv -l                  # list all locations with csv files
./download -t csv                     # download all locations with csv files to /tmp/opp_data
./download -t shapefiles -l           # list all locations with shapefiles
./download -t shapefiles              # download all locations with shapefiles to /tmp/opp_data
./download -t rds -l                  # list all locations with rds files
./download -t rds                     # download all locations with rds files to /tmp/opp_data
./download -t csv -o ~/Documents/opp  # download all locations with csv files to ~/Documents/opp
./download -t rds -o ~/Documents/opp  # download all locations with rds files to ~/Documents/opp
./download -t csv -s CA               # download all California csvs (state + city) to /tmp_opp_data
./download -c '.*beach.*'             # download all locations that have 'beach' in the city's name to /tmp/opp_data
./download -s CA -c '.*beach.*'       # download all locations in CA that have 'beach' in the city's name
./download -t rds -s CA -c 'Long Beach' -o ~/research/opp # will download the rds of Long Beach, CA data to ~/research/opp

Getting Started

Install R and clone the repository

git clone https://github.com/stanford-policylab/opp.git

Change into the repository's lib directory

cd opp/lib

Start R. The renv package should be automatically installed if not already available. Then, install the required packages using renv:

renv::restore(rebuild = TRUE)

This may take some time, as all packages must be rebuilt. For more details, see the renv package. (Note that using renv requires overriding your local .Rprofile.)

All these packages must successfully install in order to load the following main library:

source("opp.R")

Set download directory (optional); if you don't set this, it will default to /tmp/opp_data.

opp_set_download_directory('/my/data/directory')

Download some clean data

opp_download_clean_data("wa", "seattle")

Load the clean data

d <- opp_load_clean_data("wa", "seattle")

Explore!

Recreating Analyses

The easiest way to rerun all analyses from command line is the following:

./run.R --paper

However, for this to work, all the data must be downloaded and available locally. To do this we, recommend setting the data directory to a location with sufficient space and ensuring a healthy internet connection while up to 10Gb of data are downloaded. From within R, this can be done with the following:

source('opp.R')
opp_set_download_directory('/my/data/directory')
opp_download_all_clean_data()

Each analysis can also be run independently from command line:

./run.R --{disparity,marijuana,veil_of_darkness,prima_facie_stats}

They can also be run from within R code:

source('opp.R')
opp_run_{paper_analyses,disparity,marijuana_legalization_analysis,veil_of_darkness,prima_facie_stats}

Each of these effectively loads and runs the corresponding analysis script(s), which will be one of disparity.R, veil_of_darkness.R, marijuana_legalization_analysis.R, and prima_facie_stats.R. disparity.R contains both the outcome and threshold tests, which are also available as independent scripts in outcome_test.R and threshold_test.R. After running each of these, the results are saved in the opp/results directory. The analyses take anywhere from ~20 minutes to several hours to run. To run all the analyses will take about a day on a modern server.

Each of these analyses requires different subsets of the clean data and loads them using the load function defined in eligibility.R. The eligibility script contains all the filters for the data for each of the analyses. By default, the load function performs all the filters and creates the filtered dataset fresh, but it automatically saves the result to the opp/cache directory. If you run load again, you can run load(<analysis_name>, use_cache = T) to speed up load time, as it will use the post-filtered dataset from the previous run.

Reprocessing Data

Each location has it's own processing script, and these are located in opp/lib/states/state/city.R. Each script conforms to a contract that defines two methods: load_raw and clean. load_raw loads and joins all the data while making minimal changes to the raw data, while clean processes and standardizes the data to bring it into compliance with our schema defined in standards.R.

There are many convenience functions defined which can often be found in opp.R, utils.R, standardize.R, or sanitizers.R. At the end of most of these cleaning scripts there is a standarize function that adds calculated columns, selects only those columns in the schema (including those prefixed with raw_*), enforces data types (as defined in standards.R), corrects predicates (i.e. if contraband found was true but search conducted was false, contraband found is coerced to false, since nothing should be found if a search wasn't conducted -- all of these choices can be seen at the bottom of standards.R in the predicated_columns list).

If given access to the raw data, you should be able to modify the script associated with that location and run ./run.R --process --state <state> --city <city> and it will reprocess that location using the updated script.

Raw data is available upon request.