TTC Delay Analytics

Analyzing TTC bus, subway & streetcar delay data to identify delay hotspots, root causes, and find valuable insights by creating an ETL (Extract, Transform, Load) pipeline and using Looker Studio to create dashboards.

Source data

Data source - TTC bus, subway. streetcar delay data : Open Data Catalogue - City of Toronto Open Data Portal

Understanding the source data

Subway

Field Name	Description	Example
Date	Date (YYYY/MM/DD)	12/31/2016
Time	Time (24h clock)	1:59
Day	Name of the day of the week	Saturday
Station	TTC subway station name	Rosedale Station
Code	TTC delay code	MUIS
Min Delay	Delay (in minutes) to subway service	5
Min Gap	Time length (in minutes) between trains	9
Bound	Direction of train dependant on the line	N
Line	TTC subway line i.e. YU, BD, SHP, and SRT	YU
Vehicle	TTC train number	5961

Bus & Streetcar

Field Name	Description	Example
Report Date	The date (YYYY/MM/DD) when the delay-causing incident occurred	6/20/2017
Route	The number of the bus route	51
Time	The time (hh:mm:ss AM/PM) when the delay-causing incident occurred	12:35:00 AM
Day	The name of the day	Monday
Location	The location of the delay-causing incident	York Mills Station
Incident	The description of the delay-causing incident	Mechanical
Min Delay	The delay, in minutes, to the schedule for the following bus	10
Min Gap	The total scheduled time, in minutes, from the bus ahead of the following bus	20
Direction	The direction of the bus route where B,b or BW indicates both ways. (On an east-west route, it includes both east and west) NB - northbound, SB - southbound, EB - eastbound, WB - westbound	N
Vehicle	Vehicle number	1057

Infrastructure

This setup creates a VM instance using gcp and will perform all tasks using the VM.

(Note) Terraform has to be used in a local setup as the VM will be created using Terraform

Architecture Diagram (ETL pipeline) :

We will use Terraform to set up the infrastructure
We will use a VM instance to run everything
We will use Docker to run Airflow inside the VM
We will use Airflow to orchestrate the entire pipeline
We will gcs as our data lake and BigQuery as our data warehouse
We will use Spark to process, transform and clean the data

Processing: Batch using spark
Frequency: Every year (scheduled using Airflow)
Can also be increased to every month but that is not cost-effective

Airflow Dags :

src_to_gcs_dag.py downloads data from the Toronto open portal website, converts it to parquet form, loads it into gcs bucket and removes it from local folder once it is available on the cloud.

dataproc_job_dag.py uploads the main python file spark_job.py for running the spark job to gcs bucket so dataproc can fetch it, set the service account which authorizes the airflow container to work with gcp services, create a cluster, run the job and then delete it (saves cost)

Dashboard

looker studio dashboard ↗

There are no delay records for July - December 2023 as this image was taken in August 2023

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.vscode		.vscode
Airflow		Airflow
images		images
notebooks		notebooks
setup_scripts		setup_scripts
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TTC Delay Analytics

Source data

Understanding the source data

Infrastructure

Dashboard

About

Releases

Packages

Languages

LaeekAhmed/TTC-delay-analytics

Folders and files

Latest commit

History

Repository files navigation

TTC Delay Analytics

Source data

Understanding the source data

Infrastructure

Dashboard

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages