Skip to content

Analyzing TTC bus, subway & streetcar delay data to identify delay hotspots, root causes, and find valuable insights by creating an ETL pipeline and a dashboard

Notifications You must be signed in to change notification settings

LaeekAhmed/TTC-delay-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTC Delay Analytics

Analyzing TTC bus, subway & streetcar delay data to identify delay hotspots, root causes, and find valuable insights by creating an ETL (Extract, Transform, Load) pipeline and using Looker Studio to create dashboards.

diagram

Source data

Data source - TTC bus, subway. streetcar delay data : Open Data Catalogue - City of Toronto Open Data Portal

Understanding the source data

Subway

Field Name Description Example
Date Date (YYYY/MM/DD) 12/31/2016
Time Time (24h clock) 1:59
Day Name of the day of the week Saturday
Station TTC subway station name Rosedale Station
Code TTC delay code MUIS
Min Delay Delay (in minutes) to subway service 5
Min Gap Time length (in minutes) between trains 9
Bound Direction of train dependant on the line N
Line TTC subway line i.e. YU, BD, SHP, and SRT YU
Vehicle TTC train number 5961

Bus & Streetcar

Field Name Description Example
Report Date The date (YYYY/MM/DD) when the delay-causing incident occurred 6/20/2017
Route The number of the bus route 51
Time The time (hh:mm:ss AM/PM) when the delay-causing incident occurred 12:35:00 AM
Day The name of the day Monday
Location The location of the delay-causing incident York Mills Station
Incident The description of the delay-causing incident Mechanical
Min Delay The delay, in minutes, to the schedule for the following bus 10
Min Gap The total scheduled time, in minutes, from the bus ahead of the following bus 20
Direction The direction of the bus route where B,b or BW indicates both ways.
(On an east-west route, it includes both east and west)
NB - northbound, SB - southbound, EB - eastbound, WB - westbound
N
Vehicle Vehicle number 1057

Infrastructure

This setup creates a VM instance using gcp and will perform all tasks using the VM.

(Note) Terraform has to be used in a local setup as the VM will be created using Terraform

Architecture Diagram (ETL pipeline) :

diagram

  • We will use Terraform to set up the infrastructure
  • We will use a VM instance to run everything
  • We will use Docker to run Airflow inside the VM
  • We will use Airflow to orchestrate the entire pipeline
  • We will gcs as our data lake and BigQuery as our data warehouse
  • We will use Spark to process, transform and clean the data

Processing: Batch using spark
Frequency: Every year (scheduled using Airflow)
Can also be increased to every month but that is not cost-effective

Airflow Dags :

Alt text

src_to_gcs_dag.py downloads data from the Toronto open portal website, converts it to parquet form, loads it into gcs bucket and removes it from local folder once it is available on the cloud.

Alt text

dataproc_job_dag.py uploads the main python file spark_job.py for running the spark job to gcs bucket so dataproc can fetch it, set the service account which authorizes the airflow container to work with gcp services, create a cluster, run the job and then delete it (saves cost)

Dashboard

looker studio dashboard ↗

Alt text

There are no delay records for July - December 2023 as this image was taken in August 2023

Alt text

About

Analyzing TTC bus, subway & streetcar delay data to identify delay hotspots, root causes, and find valuable insights by creating an ETL pipeline and a dashboard

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published