Analyzing TTC bus, subway & streetcar delay data to identify delay hotspots, root causes, and find valuable insights by creating an ETL (Extract, Transform, Load) pipeline and using Looker Studio to create dashboards.
Data source - TTC bus, subway. streetcar delay data : Open Data Catalogue - City of Toronto Open Data Portal
Subway
Field Name | Description | Example |
---|---|---|
Date | Date (YYYY/MM/DD) | 12/31/2016 |
Time | Time (24h clock) | 1:59 |
Day | Name of the day of the week | Saturday |
Station | TTC subway station name | Rosedale Station |
Code | TTC delay code | MUIS |
Min Delay | Delay (in minutes) to subway service | 5 |
Min Gap | Time length (in minutes) between trains | 9 |
Bound | Direction of train dependant on the line | N |
Line | TTC subway line i.e. YU, BD, SHP, and SRT | YU |
Vehicle | TTC train number | 5961 |
Bus & Streetcar
Field Name | Description | Example |
---|---|---|
Report Date | The date (YYYY/MM/DD) when the delay-causing incident occurred | 6/20/2017 |
Route | The number of the bus route | 51 |
Time | The time (hh:mm:ss AM/PM) when the delay-causing incident occurred | 12:35:00 AM |
Day | The name of the day | Monday |
Location | The location of the delay-causing incident | York Mills Station |
Incident | The description of the delay-causing incident | Mechanical |
Min Delay | The delay, in minutes, to the schedule for the following bus | 10 |
Min Gap | The total scheduled time, in minutes, from the bus ahead of the following bus | 20 |
Direction | The direction of the bus route where B,b or BW indicates both ways. (On an east-west route, it includes both east and west) NB - northbound, SB - southbound, EB - eastbound, WB - westbound |
N |
Vehicle | Vehicle number | 1057 |
This setup creates a VM instance using gcp and will perform all tasks using the VM.
(Note) Terraform has to be used in a local setup as the VM will be created using Terraform
Architecture Diagram (ETL pipeline) :
- We will use Terraform to set up the infrastructure
- We will use a VM instance to run everything
- We will use Docker to run Airflow inside the VM
- We will use Airflow to orchestrate the entire pipeline
- We will gcs as our data lake and BigQuery as our data warehouse
- We will use Spark to process, transform and clean the data
Processing: Batch using spark
Frequency: Every year (scheduled using Airflow)
Can also be increased to every month but that is not cost-effective
Airflow Dags :
src_to_gcs_dag.py downloads data from the Toronto open portal website, converts it to parquet form, loads it into gcs bucket and removes it from local folder once it is available on the cloud.
dataproc_job_dag.py uploads the main python file spark_job.py for running the spark job to gcs bucket so dataproc can fetch it, set the service account which authorizes the airflow container to work with gcp services, create a cluster, run the job and then delete it (saves cost)
There are no delay records for July - December 2023 as this image was taken in August 2023