Skip to content

stpierre/crashes

Repository files navigation

Bike Collision Data Analytics

This repository contains a set of Python modules that process and analyze public accident report data from the Lincoln (Nebraska) Police Department in order to gather statistics about collisions that involve a bicycle.

This README documents the code. The results of the data can be found at http://stpierre.github.io/crashes/.

The CLI is divided into several subcommands that should be run in order:

  • fetch downloads the raw PDF accident reports from LPD;
  • jsonify extracts key data from the PDFs and generates a single (large) JSON document containing those data;
  • curate assists in manually categorizing all of the potentially-relevant accident reports;
  • geocode assists in manually geocoding (i.e., determining the exact location) of curated accident reports;
  • graph generates useful graphs of the curated data;
  • results generates the input data for the explanation of results.

Running any command before its "parent" commands have been run will automatically invoke the prerequisite commands.

Each command is described in more detail below.

fetch

fetch downloads the raw PDF accident reports from LPD. It does so by searching for reports by date, then screen-scraping the search results and downloading each report. It sleeps randomly between requests to avoid a DoS, or even the appearance of a DoS.

PDFs are saved to datadir.

jsonify

jsonify extracts data that we care about from the PDFs so that the data are more usable. It is very computationally expensive, and so runs multiple processes. (By default, it spawns one process per CPU.) jsonify extracts four data:

  • location: The location of the accident. Often this is just a street name, and the accident report must be read to find a specific location.
  • date: The date of the accident.
  • time: The time of the accident
  • report: The full text of the accident description.
  • injury_severity: The severity of the injury to the cyclist using LPD's bespoke 1-5 scale, where 1 is "Killed" and 5 is "No injury." 5's are rarely (if ever) reported.
  • injury_region: Body region of primary injury to the cyclist.
  • cyclist_dob: Date of birth of the cyclist.

Additional data can be added and (fairly) easily added to the data set with the --reparse-curated flag.

As it turns out, PDFs in general and the accident report PDFs specifically are an appalling disaster, so this parsing is decidedly crufty. RTFS at your peril.

jsonify takes a few(optional) arguments:

  • --processes can be used to specify the number of processes to spawn, in case you don't want to melt your CPU.
  • --reparse-curated tells jsonify to only parse those accident reports that have already been curated and identified as bike-car collisions.
  • Any additional arguments are filenames to parse, which will be used instead of trying to parse all of the PDFs in the datadir.

If filenames are supplied to jsonify, the results are printed to stdout instead of added to reports.json. This is mostly useful for testing changes to the jsonify code.

curate

Once the data have been extracted, we must find collisions that involved a bicycle. This is, unfortunately, a manual process. curate iterates over every accident report that includes the word bicycle, bike, cyclist, or bicyclist. Each report is then manually assigned one of five statuses:

  • crosswalk (C): Collision happened while a person on a bicycle was using a crosswalk.
  • sidewalk (S): Collision happened while a person on a bicycle was riding on a sidewalk. For instance, a car entering or leaving a private driveway or, in extreme situations, a car that jumps the curb.
  • road (R): Collision happened while a person on a bicycle was riding on the road, excluding intersections.
  • intersection (I): Collision happened while a person on a bicycle was riding through an intersection on the road, not using a crosswalk.
  • elsewhere (E): Collision happened elsewhere. This also includes collisions that happened on the road, but where the cyclist was not riding on the road as such. (E.g., the cyclist was crossing the street away from a crosswalk.)
  • not_involved (N): Bicycle was not involved in the collision. A cyclist may have been a witness, or a bike rack damaged, etc.

geocode

After the data has been curated, we want to geocode the bike-related collisions in order to map them; this command assists with that semi-manual process. The "Location" field on accident reports is frequently ambiguous or incomplete, so geocode iterates over each bike-related accident and attempts to use the "Location" field as provided on the report, plus any user input necessary, to look up the exact location of the accident (using the Google Geocoding API) and output GeoJSON to be used in mapping.

locate

After the data has been geocoded, this command can help find particularly interesting collisions by location. Currently it's hardcoded to find sidewalk and crosswalk collisions near bike paths. Bikeway data is taken from https://github.com/stpierre/lincoln-bike-routes

xform

Transform data so that we can produce pretty graphs of the data.

results

Render a template that includes an explanation of the results in long form. Currently that template is a Jinja2 template, so Jinja2 must be run to generate the final site.

Configuration

The following configuration options (in crashes.conf) are recognized:

Section Name Description Default
form url The POST URL of LPD's accident report search form. HTTP://CJIS.LINCOLN.NE.GOV/HTBIN/CGI.COM
form token The POST token to include in accident report search POSTs. DISK0:[020020.WWW]ACCDESK.COM
form sleep_min Minimum time, in seconds, to sleep between requests to LPD's website. 5
form sleep_max Maximum time, in seconds, to sleep between requests to LPD's website. 30
fetch days Days of accident report data to download. 365
fetch start Date (in YYYY-MM-DD format) from which to download collision data. If start is given, it takes precedence over days. None
fetch retries Number of times to retry an HTTP request to LPD's website, either for submitting the search form or for downloading a report. 3
files datadir Base directory to use for persistent data storage. ./data
files pdfdir Directory, relative to datadir, where accident report PDFs will be stored. pdfs
files all_reports File, relative to datadir, where the results of the jsonify command will be stored. reports.json
files curation_results File, relative to datadir, where the results of the curate command will be stored. curation.json
files geocoding Directory, relative to datadir, where output from the geocode command will be stored. geojson
files imagedir Directory, relative to datadir, where graph images will be stored. images
files template Jinja2 template for results. ./results.html
files results_output Filename to write results output to. ./index.html
files bike_route_geojson Path to a GeoJSON file containing all known bikeways. None
files lb716_results File, relative to datadir, Where the results of the locate command will be stored. lb716.json

About

Bicycle Crash Analytics (Lincoln, NE)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published