Skip to content

Conversation

Hui-Cheng-AirBnb
Copy link
Collaborator

@Hui-Cheng-AirBnb Hui-Cheng-AirBnb commented Sep 29, 2025

Summary

Add and validate offline groupby option for external source. This is the first part of the task to support backfill for external sources. This PR does:

  1. Add offline GroupBy to ExternalSource
  2. Validate the GroupBy schema is compatible with the External source.
  3. Add offline GroupBy as part of the join parts for offline backfill

Why / Goal

We want to introduce a first-level abstraction for External source to support offline backfill. The idea is that an ExternalSource now can take in an optional GroupBy definition. During offline backfill, Chronon will dispatch to the GroupBy for backfill.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested

========= Test configs =======
Test configs are at https://git.musta.ch/airbnb/ml_models/pull/27812

========= Export schema with External offline groupby =========

python \
    run_on_bigair.py run \
    production/joins/zipline_test/test_online_join_external.v2 \
    --envs "VERSION:hui-cheng-ext-gb-1-0.0.110-SNAPSHOT" \
    -- \
    --ds=2025-10-12 \
    --mode=analyze \
    --skip-timestamp-check \
    --skip-table-permission-check \
    --export-schema
Screenshot 2025-10-17 at 9 16 47 AM

========= Backfill join with External offline groupby ===========

python \
    run_on_bigair.py run \
    production/joins/zipline_test/test_online_join_external.v2 \
    --envs "VERSION:hui-cheng-ext-gb-1-0.0.110-SNAPSHOT" \
    -- \
    --ds=2025-10-12 \
    --mode=backfill

https://superset.a.musta.ch/sqllab/p/Oa8yamlbN8K/

======== Backfill join WITHOUT External offline groupby ========

python \
    run_on_bigair.py run \
    production/joins/zipline_test/test_online_join_small.v2 \
    --envs "VERSION:hui-cheng-ext-gb-1-0.0.110-SNAPSHOT" \
    -- \
    --ds=2025-10-12 \
    --mode=backfill

https://superset.a.musta.ch/sqllab/p/gKx6rAQ2kxz/

Compare with backfilled table with official Chronon build
https://superset.a.musta.ch/sqllab/p/4RJ9e9zBwJA/

========= Export schema WITHOUT External offline groupby =====

python \
    run_on_bigair.py run \
    production/joins/zipline_test/test_online_join_small.v2 \
    --envs "VERSION:hui-cheng-ext-gb-1-0.0.110-SNAPSHOT" \
    -- \
    --ds=2025-10-12 \
    --mode=analyze \
    --skip-timestamp-check \
    --skip-table-permission-check \
    --export-schema

Check schema matches with the schema exported from official package
https://superset.a.musta.ch/sqllab/p/6B3NRO5A13R/

Checklist

  • Documentation update

Reviewers

@Hui-Cheng-AirBnb Hui-Cheng-AirBnb changed the title Add and validate offline groupby option for external source Add offlineGroupBy option for external source Oct 1, 2025
@hzding621
Copy link
Collaborator

hzding621 commented Oct 17, 2025

We may also need to update:

These logics are used in join backfill DAGs where each join_parts are run in parallel.

Today, production run uses: https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/JoinBase.scala#L477 which has been handled in this PR. For dev run, we use join_backfill.py to orchestrate a fine-grained DAG, where each task runs, computeLeft, computeJoinPartOpt, and computeFinalJoin in different spark jobs. We have unit test for this scenario in JoinFlowTest.scala

@Hui-Cheng-AirBnb
Copy link
Collaborator Author

Hui-Cheng-AirBnb commented Oct 17, 2025

We may also need to update:

These logics are used in join backfill DAGs where each join_parts are run in parallel.

Today, production run uses: https://github.com/airbnb/chronon/blob/main/spark/src/main/scala/ai/chronon/spark/JoinBase.scala#L477 which has been handled in this PR. For dev run, we use join_backfill.py to orchestrate a fine-grained DAG, where each task runs, computeLeft, computeJoinPartOpt, and computeFinalJoin in different spark jobs. We have unit test for this scenario in JoinFlowTest.scala

Thanks @hzding621 , I've made a change and added unit tests. Let me know if you think we should do a test dev run (if yes, please share instructions of dev run)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants