Code originally in support of this post: "Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance"
This repo provides scripts to download, process, and analyze data for billions of taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009. Most of the raw data comes from the NYC Taxi & Limousine Commission.
The data is stored in a PostgreSQL database, and uses PostGIS for spatial calculations.
Statistics through June 30, 2018:
- 2.07 billion total trips
- 1.55 billion taxi
- 519 million for-hire vehicle
- 253 GB of raw data
- Database takes up 347 GB on disk with minimal indexes
1. Install PostgreSQL and PostGIS
Both are available via Homebrew on Mac
./download_raw_data.sh && ./remove_bad_rows.sh
The remove_bad_rows.sh
script fixes two particular files that have a few rows with too many columns. See the "data issues" section below for more.
Note that the raw data is hundreds of GB, so it will take a while to download.
./initialize_database.sh
./import_trip_data.sh
./import_fhv_trip_data.sh
The full import process takes ~36 hours on a 2013 MacBook Pro with 16 GB of RAM.
The FiveThirtyEight Uber dataset contains Uber trip records from Apr–Sep 2014. Uber and other FHV (Lyft, Juno, Via, etc.) data is available since Jan 2015 in the TLC's data.
./download_raw_2014_uber_data.sh
./import_2014_uber_trip_data.sh
Additional Postgres and R scripts for analysis are in the analysis/
folder, or you can do your own!
trips
table contains all yellow and green taxi trips. Each trip has acab_type_id
, which references thecab_types
table and refers to one ofyellow
orgreen
fhv_trips
table contains all for-hire vehicle trip records, including ride-hailing apps Uber, Lyft, Via, and Junofhv_bases
mapsfhv_trips
to base names and "doing business as" labels, which include ride-hailing app namesnyct2010
table contains NYC census tracts plus the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areastaxi_zones
table contains the TLC's official taxi zone boundaries. Starting in July 2016, the TLC no longer provides pickup and dropoff coordinates. Instead, each trip comes with taxi zone pickup and dropoff location IDscentral_park_weather_observations
has summary weather data by date
These are bundled with the repository, so no need to download separately, but:
- Shapefile for NYC census tracts and neighborhood tabulation areas comes from Bytes of the Big Apple
- Shapefile for taxi zone locations comes from the TLC
- Mapping of FHV base numbers to names comes from the TLC
- Central Park weather data comes from the National Climatic Data Center
- Remove carriage returns and empty lines from TLC data before passing to Postgres
COPY
command - Some raw data files have extra columns with empty data, had to create dummy columns
junk1
andjunk2
to absorb them - Two of the
yellow
taxi raw data files had a small number of rows containing extra columns. I discarded these rows - The official NYC neighborhood tabulation areas (NTAs) included in the census tracts shapefile are not exactly what I would have expected. Some of them are bizarrely large and contain more than one neighborhood, e.g. "Hudson Yards-Chelsea-Flat Iron-Union Square", while others are confusingly named, e.g. "North Side-South Side" for what I'd call "Williamsburg", and "Williamsburg" for what I'd call "South Williamsburg". In a few instances I modified NTA names, but I kept the NTA geographic definitions
- The shapefile includes only NYC census tracts. Trips to New Jersey, Long Island, Westchester, and Connecticut are not mapped to census tracts, with the exception of the Newark Airport
Google BigQuery and Amazon Redshift would probably provide significant performance improvements over PostgreSQL. A lot of the data is already available on BigQuery, but in scattered tables, and each trip has only latitude and longitude coordinates, not census tracts and neighborhoods. PostGIS seemed like the easiest way to map coordinates to census tracts. Once the mapping is complete, it might make sense to load the data back into BigQuery or Redshift to make the analysis faster. Note that BigQuery and Redshift cost some amount of money, while PostgreSQL and PostGIS are free.
Mark Litwintschik has used the taxi dataset to benchmark performance of many different technology stacks, his summary is here: http://tech.marksblogg.com/benchmarks.html
There's a Ruby script in the tlc_statistics/
folder to import data from the TLC's summary statistics reports:
ruby import_statistics_data.rb
Code in support of the post "When Are Citi Bikes Faster Than Taxis in New York City?" lives in the citibike_comparison/
folder
Code in support of the 2017 update to the original post lives in the analysis/2017_update/
folder
[email protected], or open a GitHub issue