-
Notifications
You must be signed in to change notification settings - Fork 0
GerryChain
GerryChain is a library for using Markov Chain Monte Carlo methods to study the problem of political redistricting. It is a python library. PGP has so far used GerryChain to analyze plans in Missouri.
The basic workflow is to start with the geometry of an initial plan (aka "initial partition") and generate a large collection of sample plans for comparison. We constrain these sampled plans based on contiguity, compactness, county splits, and/or population deviation. Comparing the initial plan to the ensemble provides quantitative tools for measuring whether or not it is an outlier among the sampled plans.
The GitHub repository for GerryChain is here, and the API reference is here with the most detailed information.
In April 2020, during the course of running GerryChain for Missouri, Hope had a series of conversations with researchers from MGGG about the library. This document synthesizes what was discussed in those conversations so that Hope+PGP doesn't forget the details.
There are two ways to run a chain: from a neutral map or from an enacted map. The goal is to explore through the whole space of plans, and so if we think from the Markov Chain perspective, starting from the enacted plan might not be great if it doesn't lie in the ideal state's space. This is because if you need to take lots of steps to get one good sample, you waste a lot of time! Starting from an enacted plan and a neutral map should both work; it's a modeling question about which one makes the most sense.
To start from a neutral map, there is a tool in GerryChain to build new seeds. When creating a seed plan, sometimes you might run into an issue where the seed creation takes infinite time. This can happen because there is a bug when the chain gets stuck in a partition that isn't going to work. To avoid this, try creating a seed plan and give it 10 minutes. If it runs and works, save the seed plan.
Another option is to try a bunch of different parallel random seeds and see which one works and save out those seeds. I think this can be done with the recursive_tree_part
function in the spanning tree methods.
When the proposal is a recom, the resulting plans are always contiguous. There is no need to add that to the validator. Other things like population deviation and county splits need to be programmed. We should be doing recom all the time because of MGGG's Virginia research.
Note that the more strictly balanced the population constraint, the longer the chain will take to run. If there are many districts, such as House of Reps in a large state, and a small population deviation, such as 1%, the chain will take many weeks to run.
To check for convergence, look at the number of seats from each plan and other statistics like EG, mean-median difference, partisan bias, etc. at 100 steps, 1,000 steps, 10,000 steps, and so on. By creating a histogram of those statistics at varying numbers of steps you can check how the space has been explored.
If running the chain locally, it might take a while to run, even at 10,000 steps (~12 hours). Running it on the Princeton cluster is probably a good idea.
I recommend storing the outputs as JSON objects or feather objects in order to best persist the data. JSON is manually inspect-able which is handy, and feather objects keep track of variable typing. Pickle is problematic because the pickle version that you use to pickle and un-pickle needs to be the same, otherwise pickle might not read your file properly.
If looking into minority representation in a year that doesn't end in "0", use block-groups rather than census blocks. Do the analysis on block groups rather than precincts. This eliminates the sketchy task of assigning census blocks to census block groups, then assign precincts to blocks and disaggregate.
Remember that sampling is more of an art than a science.