Parses San Francisco BART hourly origin-destination data into graph files.
For a Shiny application that is heavily based on this, see bart-passenger-heatmap
.
This script is designed to parse BART hourly origin-destination files and output weighted graph files. Parsing these into graph files is useful for visualizing the BART network, due to the differing edge weights. The origin-destination data is provided on their website, available starting from 2011-01-01.
The script does the following:
- Parses an input graph to set up the network topology.
- Reads in the hourly data line by line.
- For each origin-destination pair, calculate the shortest path between the two stations.
- Due to the current topology of the network, there will only ever be one shortest path.
- Add the number of passengers to each edge on the shortest path.
- Output the graph file, in the desired format.
This is a Python 3 script.
The script uses the following packages that aren't provided in base Python:
networkx
for parsing the input graph.pynumparser
for range parsing the hour and weekday arguments.
Both of these can be found on pip
.
If you'd like to set up a virtual environment, the typical code to do so is below:
python3 -m venv env
source env/bin/activate
python3 -m pip install -r requirements.txt
python3 bart-hourly-dataset-parser.py [-h] [--flags] input.graph bartdata.csv output.graph
The following flags are available.
For more information, use the -h
flag when running the script.
--hour
allows you to subset the input data and choose hours that you want to keep.
This uses 24 hour format, to match the input data style.
Multiple ranges are allowed.
Here is an example of a valid subset: 0-12,15,19-21,23.
--weekday
allows you to subset the input data and choose only the weekdays that you want to keep.
The numbering matches the system used in Python's datetime
package.
Monday corresponds to 0, and Sunday corresponds to 6.
Multiple ranges are allowed.
Here is an example of a valid subset: 0-2,4,6.
--startdate
and --enddate
allow you to subset the input data and choose a range of dates that you want to keep.
If both flags are not provided, the script will parse all data in the input CSV.
The script expects ISO 8601-style dates.
Therefore, 25 January 2016 corresponds to 2016-01-25.
-d
and --directed
allow you to specify that a directed graph should be output.
By default, the script defaults to an undirected graph.
-k
and --keepweights
allow you to keep the original weights from the input graph file.
The script will then add on the weight from the input CSV.
This is useful if you are, say, trying to grab data from multiple years, and don't want to combine the data files for each year by hand.
This flag shouldn't be used for the first run, when initially generating data from the basic topological graph.
The edge weights will be slightly wrong in that case.
The representation of the BART network that serves as the basis to calculate shortest paths.
An example file has already been provided at data/bart.net
.
Note that this file will need to be modified to keep up with expansions in the BART network.
See the Dataset Caveats section below.
The CSV files provided by BART on their website. The script does a decent job in offering options to subset the data. If you need more detailed subsetting options, I recommend doing so with another program.
The output graph file. Currently, the following file formats are supported:
- GEXF
- Pajek NET
NetworkX supports many other graph file formats. Adding these would be fairly simple.
The following sections were written and are true as of 2021-07-31. Changes to the BART network will change the validity of the following sections.
BART has been changing since the start of their data set collection. In particular, the following changes have been made since 2011-01-01, the first day available in the dataset:
The West Dublin / Pleasanton station opened for revenue service on 2011-02-19. Dates before this (of which there are not that many) will have traffic identical on both sides of the station, as this is an infill station.
The Oakland International Airport (OAK) station opened for revenue service on 2014-11-22. Dates before this will have no traffic to this node, as this is a terminal station. This may pose a problem for certain programs, such as Gephi.
The Warm Springs / South Fremont station opened for revenue service on 2017-03-25. Dates before this will have no traffic to this node, as this is a terminal station. This may pose a problem for certain programs, such as Gephi.
The eBART extension to Pittsburg Center and Antioch opened for revenue service on 2018-05-26. Dates before this will have no traffic to either of these stations, as these nodes extend past a terminal station. This may pose a problem for certain programs, such as Gephi.
The Milpitas and Berryessa stations opened for revenue service on 2020-06-13. Dates before this will have no traffic to these nodes, as Berryessa is a terminal station. This may pose a problem for certain programs, such as Gephi.
There are also changes that have been made or planned since the last day available in the dataset.
The Silicon Valley BART extension to the following stations are currently planned to open in 2030:
- Alum Rock
- Downtown San Jose
- Diridon / Arena
- Santa Clara
The input network file and station_names
dictionary in utils/input.py
will need to be modified to support these stations when they open.
The BART network changes at certain hours of the day. The following lines run to Millbrae and San Francisco International Airport (SFO):
- Antioch - SFO / Millbrae
- Richmond - Daly City / Millbrae
During weekdays before 21:00, the Richmond - Daly City / Millbrae continues on from Daly City to Millbrae, skipping SFO. After 21:00 on weekdays, and on weekends, that line terminates at Daly City. Instead, a separate line (the SFO-Millbrae Shuttle) runs instead.
In order to be completely correct, the script should take this into account. Traffic that has Millbrae as one of its starting or ending points should be checked to see what day and time it occurs in. If the traffic occurs before 21:00 on a weekday, only the San Bruno - Millbrae edge should have weight added to it. At the other times, both the San Bruno - SFO and SFO - Millbrae edges should have weight added to them.
This is not a significant change, but I have not added it to the script. Thus, results involving those stations are slightly incorrect.
I have a couple of changes in mind for the future:
- Add new BART expansions to the script
- Update the script logic to account for the Millbrae and SFO edge change, described above
- Add the ability to read and write the rest of the graph file formats that NetworkX supports