For this homework, you will need the following datasets:
- Green Taxi dataset (2019 and 2020)
- Yellow Taxi dataset (2019 and 2020)
- For Hire Vehicle dataset (2019)
Before you start,
- Make sure you, at least, have them in GCS with a External Table OR a Native Table - use whichever method you prefer to accomplish that (Workflow Orchestration with pandas-gbq, dlt for gcs, dlt for BigQuery, gsutil, etc)
- You should have exactly
7,778,101
records in your Green Taxi table - You should have exactly
109,047,518
records in your Yellow Taxi table - You should have exactly
43,244,696
record in your FHV table - Build the staging models for green/yellow as shown in here
Note: If you don't have access to GCP, you can spin up a local Postgres instance and ingest the datasets above
Provided you've got the following sources.yaml
version: 2
sources:
- name: raw_nyc_tripdata
database: "{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}"
schema: "{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}"
tables:
- name: ext_green_taxi
- name: ext_yellow_taxi
with the following env variables setup where dbt
runs:
export DBT_BIGQUERY_PROJECT=myproject
export DBT_BIGQUERY_DATASET=my_nyc_tripdata
What does this .sql model compile to?
select *
from {{ source('raw_nyc_tripdata', 'ext_green_taxi' ) }}
select * from dtc_zoomcamp_2025.raw_nyc_tripdata.ext_green_taxi
select * from dtc_zoomcamp_2025.my_nyc_tripdata.ext_green_taxi
select * from myproject.raw_nyc_tripdata.ext_green_taxi
select * from myproject.my_nyc_tripdata.ext_green_taxi
select * from dtc_zoomcamp_2025.raw_nyc_tripdata.green_taxi
Say you have to modify the following dbt_model (fct_recent_taxi_trips.sql
) to enable Analytics Engineers to dynamically control the date range.
- In development, you want to process only the last 7 days of trips
- In production, you need to process the last 30 days for analytics
select *
from {{ ref('fact_taxi_trips') }}
where pickup_datetime >= CURRENT_DATE - INTERVAL '30 days'
What would you change to accomplish that in a such way that command line arguments takes precedence over ENV_VARs, which takes precedence over DEFAULT value?
- Add
ORDER BY pickup_datetime DESC
andLIMIT {{ var("days_back", 30) }}
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", 30) }}' DAY
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", "30") }}' DAY
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ var("days_back", env_var("DAYS_BACK", "30")) }}' DAY
- Update the WHERE clause to
pickup_datetime >= CURRENT_DATE - INTERVAL '{{ env_var("DAYS_BACK", var("days_back", "30")) }}' DAY
Considering the data lineage below and that taxi_zone_lookup is the only materialization build (from a .csv seed file):
Select the option that does NOT apply for materializing fct_taxi_monthly_zone_revenue
:
dbt run
dbt run --select +models/core/dim_taxi_trips.sql+ --target prod
dbt run --select +models/core/fct_taxi_monthly_zone_revenue.sql
dbt run --select +models/core/
dbt run --select models/staging/+
Consider you're dealing with sensitive data (e.g.: PII), that is only available to your team and very selected few individuals, in the raw layer
of your DWH (e.g: a specific BigQuery dataset or PostgreSQL schema),
-
Among other things, you decide to obfuscate/masquerade that data through your staging models, and make it available in a different schema (a
staging layer
) for other Data/Analytics Engineers to explore -
And optionally, yet another layer (
service layer
), where you'll build your dimension (dim_
) and fact (fct_
) tables (assuming the Star Schema dimensional modeling) for Dashboarding and for Tech Product Owners/Managers
You decide to make a macro to wrap a logic around it:
{% macro resolve_schema_for(model_type) -%}
{%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET' -%}
{%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%}
{%- if model_type == 'core' -%} {{- env_var(target_env_var) -}}
{%- else -%} {{- env_var(stging_env_var, env_var(target_env_var)) -}}
{%- endif -%}
{%- endmacro %}
And use on your staging, dim_ and fact_ models as:
{{ config(
schema=resolve_schema_for('core'),
) }}
That all being said, regarding macro above, select all statements that are true to the models using it:
- Setting a value for
DBT_BIGQUERY_TARGET_DATASET
env var is mandatory, or it'll fail to compile - Setting a value for
DBT_BIGQUERY_STAGING_DATASET
env var is mandatory, or it'll fail to compile - When using
core
, it materializes in the dataset defined inDBT_BIGQUERY_TARGET_DATASET
- When using
stg
, it materializes in the dataset defined inDBT_BIGQUERY_STAGING_DATASET
, or defaults toDBT_BIGQUERY_TARGET_DATASET
- When using
staging
, it materializes in the dataset defined inDBT_BIGQUERY_STAGING_DATASET
, or defaults toDBT_BIGQUERY_TARGET_DATASET
Alright, in module 1, you had a SQL refresher, so now let's build on top of that with some serious SQL.
These are not meant to be easy - but they'll boost your SQL and Analytics skills to the next level.
So, without any further do, let's get started...
You might want to add some new dimensions year
(e.g.: 2019, 2020), quarter
(1, 2, 3, 4), year_quarter
(e.g.: 2019/Q1
, 2019-Q2
), and month
(e.g.: 1, 2, ..., 12) to your fct_taxi_trips
OR dim_taxi_trips.sql
models to facilitate filtering your queries
- Create a new model
fct_taxi_trips_quarterly_revenue.sql
- Compute the Quarterly Revenues for each year for based on
total_amount
- Compute the Quarterly YoY (Year-over-Year) revenue growth
- e.g.: In 2020/Q1, Green Taxi had -12.34% revenue growth compared to 2019/Q1
- e.g.: In 2020/Q4, Yellow Taxi had +34.56% revenue growth compared to 2019/Q4
Considering the YoY Growth in 2020, which were the yearly quarters with the best (or less worse) and worst results for green, and yellow
- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q2, worst: 2020/Q1}
- green: {best: 2020/Q2, worst: 2020/Q1}, yellow: {best: 2020/Q3, worst: 2020/Q4}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q2, worst: 2020/Q1}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q1, worst: 2020/Q2}
- green: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q3, worst: 2020/Q4}
- Create a new model
fct_taxi_trips_monthly_fare_p95.sql
- Filter out invalid entries (
fare_amount > 0
,trip_distance > 0
, andpayment_type_description in ('Cash', 'Credit Card')
) - Compute the continous percentile of
fare_amount
partitioning by service_type, year and and month
Now, what are the values of p97
, p95
, p90
for Green Taxi and Yellow Taxi, in April 2020?
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}
- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 52.0, p95: 37.0, p90: 25.5}
- green: {p97: 40.0, p95: 33.0, p90: 24.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}
- green: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 52.0, p95: 25.5, p90: 19.0}
Prerequisites:
- Create a staging model for FHV Data (2019), and DO NOT add a deduplication step, just filter out the entries where
where dispatching_base_num is not null
- Create a core model for FHV Data (
dim_fhv_trips.sql
) joining withdim_zones
. Similar to what has been done here - Add some new dimensions
year
(e.g.: 2019) andmonth
(e.g.: 1, 2, ..., 12), based onpickup_datetime
, to the core model to facilitate filtering for your queries
Now...
- Create a new model
fct_fhv_monthly_zone_traveltime_p90.sql
- For each record in
dim_fhv_trips.sql
, compute the timediff, in seconds between dropoff_datetime and pickup_datetime - we'll call ittrip_duration
for this exercise - Compute the continous
p90
oftrip_duration
partitioning by year, month, and pickup_location
For the Trips that respectively started from Newark Airport
, SoHo
, and Yorkville East
, in November 2019, what are dropoff_zones with the 2nd longest p90 trip_duration ?
- East Village, Chinatown, Garment District
- East Village, Park Slope, Clinton East
- East Village, Saint Albans, Howard Beach
- East Village, Rosedale, Bath Beach
- East Village, Yorkville East, Greenpoint
- Form for submitting: https://courses.datatalks.club/de-zoomcamp-2025/homework/hw4
- To be published after deadline