Efficient support for multiple daily DAGs to trigger one another via datasets with XCOM payload #33069

scr-oath · 2023-08-01T23:41:12Z

scr-oath
Aug 1, 2023

Description

Provide a mechanism to pass data (XCOM?) so that downstream DAGs could know more context about how/why/when/by-what they were triggerered.

Use case/motivation

In order to avoid writing monolithic DAGs, it would seem useful to have separate DAGs focused on discrete Input and Output transforms, which would also allow them to be retried/rescheduled as needed. One could imagine daily batch-processing comprised of several DAGs and think of using the dataset mechanism as a way to trigger efficiently. However, it seems that no information comes along with a dataset passed in each DAG's "schedule". If several days of daily tasks are (re-)scheduled, the outlet of a dataset would not be able to communicate to downstream DAGs what the "datestamp" was for them to process.

As of now the dataset is just a string and, when loosely coupling a producer/consumer via the Dataset, there is no way to communicate specific information about the producer's exact output. There also doesn't appear to be a way to mix-n-match scheduling based on a dataset as well as @daily e.g. so there's no way to connect a particular day's producer DAG with a consumer DAG.

If a task could query its lineage and specifically get data / XCOM information from the DAG/task/Dataset that triggered it, then it could take efficient actions based on the previous task's specific output location (i.e. its datestamp directory if that's the convention, but could be anything, really if a general way of passing/receiving data were provided.)

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2023-08-01T23:41:14Z

boring-cyborg[bot]
bot Aug 1, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

RNHTTR · 2023-08-02T11:26:48Z

RNHTTR
Aug 2, 2023
Collaborator

Is this not currently possible?

Outlet DAG:

from pendulum import datetime

from airflow.decorators import (
    dag,
    task,
)

from airflow.datasets import Dataset


@dag(
    schedule=None,
    start_date=datetime(2023, 1, 1),
    catchup=False,
    default_args={
        "retries": 2,  # If a task fails, it will retry 2 times.
    },
    tags=["example"],
)  # If set, this tag is shown in the DAG view of the Airflow UI
def outlet():
    @task(outlets=[Dataset("outlet")])
    def write_xcom(ti=None):
        ti.xcom_push(key="outlet_xcom", value="xyz")

    write_xcom()


outlet()

Inlet DAG:

import json
from pendulum import datetime

from airflow.decorators import (
    dag,
    task,
)

from airflow.datasets import Dataset


@dag(
    schedule=[Dataset("outlet")],
    start_date=datetime(2023, 1, 1),
    catchup=False,
    default_args={
        "retries": 2,  # If a task fails, it will retry 2 times.
    },
    tags=["example"],
)
def inlet():
    @task()
    def read_xcom(ti=None):
        xcoms = ti.xcom_pull(
            dag_id="outlet",
            task_ids="write_xcom",
            key="outlet_xcom",
            include_prior_dates=True,
        )
        print(f"xcoms: {xcoms}")

    read_xcom()


inlet()

[2023-08-02, 11:24:20 UTC] {logging_mixin.py:150} INFO - xcoms: xyz

0 replies

jscheffl · 2023-08-02T21:12:47Z

jscheffl
Aug 2, 2023
Collaborator

Never tested myself but the code looks good. So if this is really the way of working then I still see a high demand that docs have this added as example and maybe the example dataset DAGs shipped include this in example code :-D

0 replies

scr-oath · 2023-08-03T01:10:09Z

scr-oath
Aug 3, 2023
Author

Well… this does seem like a reasonable approach, but then… it tightly couples the two dags.

There are some colleagues looking to version their dags by essentially adjusting the name of the file during deployment and using Path(__file__).stem as the dag_id. They're aware of https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-36+DAG+Versioning but also not certain of its ETA.

Using the approach you suggested would not allow decoupling/loose-coupling of the producer and consumer dag through the dataset.

But it DOES make me think that maybe the feature would be to carry the "triggering information" with dataset triggers… if the consumer dag (inlet in your case) could be told the name of the dag and/or dataset (if multiple) that triggered it, then it could ask for xcomm information by dag_id…

However, it makes me think "what if the producer ran multiple times before the downstream got its trigger" - not sure how often that would be the case - but… would the above solution allow for multiple separate triggers to carry their own xcom information related to their work? Conceptually, you want to be able to get an event about each dataset write about what happened and I feel like using XCOM has the chance of race-conditions or skipping events - if you go read it - does it just always get the "last value"?

3 replies

RNHTTR Aug 3, 2023
Collaborator

The dag_id wouldn't change, though, right? Even if the dag_id does change, could you track the DAG ID in a separate file? e.g. a JSON or YAML file (or even import from a different Python file)?

scr-oath Aug 3, 2023
Author

But what your suggesting still tightly couples the consumer with the producer - if the producer is owned by a different team and has a different lifecycle from the consumer, I'm saying that it would make sense to loosely couple the choice of dag_id, etc if there were some sort of event mechanism that carried information - the dataset as outlet/inlet/scheduling trigger seemed to be that thing but at the moment doesn't carry any metadata with it as far as I can tell. Communication between tasks seems to have the ability to pass information and I was wondering about passing between DAGs.

But do tell me this - I may have had higher hopes for how things work between tasks - let's imagine that you deploy a new dag with daily schedule a start time as Jan 1 2023 - and it schedules a dag instance every day from then to now - in the current implementation - is there any chance of race condition of the xcom data from one instance being read from a task of another instance?

In your example of cross-dag xcom usage, it appears (correct me if I misunderstand) that the "instance" information is lost - and I suspect that it just grabs the latest info - is that correct? Is it the latest passing info? what if one previous dag run passes and the next fails - does it get null or some fail value in that case?

RNHTTR Aug 3, 2023
Collaborator

is there any chance of race condition of the xcom data from one instance being read from a task of another instance?

If I understand correctly, no, as long as include_prior_dates=False (the default)

In your example of cross-dag xcom usage, it appears (correct me if I misunderstand) that the "instance" information is lost - and I suspect that it just grabs the latest info

If the upstream DAG writes an XCOM with value xyz in a dagrun, then that DAG is updated to write an XCOM with value 123, and you re-run (i.e. clear) the original dag run that wrote xyz, a new dag run for the downstream DAG will be triggered, but it will still read the xcom with value xyz. So, no, the XCOM data for the task instance from the original dag run is not lost.

what if one previous dag run passes and the next fails - does it get null or some fail value in that case?

If a dag that writes to a particular outlet dataset has a dag run that fails, a dag run will not be triggered for downstream dags who depend on that dataset.

scr-oath · 2023-08-03T17:41:33Z

scr-oath
Aug 3, 2023
Author

I just tried this with

    schedule='@daily',
    start_date=pendulum.datetime(2023, 7, 1, tz="UTC"),

and the outlet task puts the ds value.

I get varying results of several repeats, several Nones if the race asks as the outlet is running but hasn't written yet…

So this workaround mechanism does not seem very predictable as an "event" mechanism to tie two dags together via a dataset + xcom

3 replies

RNHTTR Aug 3, 2023
Collaborator

Can you try removing the include_prior_dates=True, param from the downstream task?

scr-oath Aug 3, 2023
Author

Yep - just tried - it's 100% None with that disabled.

RNHTTR Aug 3, 2023
Collaborator

Okay, keeping include_prior_dates=True, and setting max_active_runs on the producer DAG should solve this for DAGs that run with a @daily schedule, although I acknowledge that race conditions can arise when multiple producer DAGs are running in parallel.

scr-oath · 2023-08-03T20:02:16Z

scr-oath
Aug 3, 2023
Author

I just tried another tack - but found another shortcoming - I tried setting up a consumer dag with @daily schedule and was surprised to see that with @task.sensor none of the ti=None, ds=None get injected - therefore, the sensor for a daily run can't wait for its day to run - without having the feature in this discussion, there seems to be no easy way/pattern for setting up daily runs of dags that depend on previous dags in an "efficient" way… it would seem that preference for some good/efficient pattern would be

event-like construct where downstream dag(s) may trigger on daily jobs and get the information about the triggering instance - its XCOM, e.g. or whatever coordinates could pull the instance XCOM data or minimally the datestamp/timestamp that it ran with.
sensors that don't "eat" a "slot" by polling without "reschedule" and where each "day" (or other period) can look for its data for some timeout
worst case - I do see the somewhat ugly polling style of scheduling a dag every 5m and examining all of the input locations and output locations in a "circuitbreaker" task that either returns None (no work to do) or the xcom info about what work it found for the next task in that dag to pick up and perform.

I mean… since this is now a discussion, I really just want to learn/decide what a good, efficient, connected mechanism/pattern would be to have daily batch jobs as DAGs that can have multiple downstream things connected by some fixed/constant thing - a la dataset string, but also only do "their work" - i.e. the work related to the upstream job that wrote the dataset/outlet.

5 replies

RNHTTR Aug 3, 2023
Collaborator

If context variables (e.g. ti=None, ds=None) don't flow through to @task.sensor, that seems like a bug and likely warrants a separate Issue.

I think this is a good point. Tasks in a downstream DAG should have access to the dagrun that triggered it. I'll re-open this as an Issue.
Check out deferrable operators. Deferrables make available tasks that don't consume worker slots.

That said, a consumer DAG that has a single producer shares the same data interval as the producer DAG.

RNHTTR Aug 3, 2023
Collaborator

FYI: #33088

scr-oath Aug 4, 2023
Author

FYI #33121

scr-oath Aug 4, 2023
Author

May be related to #30229 too (documentation)

scr-oath Aug 4, 2023
Author

FWIW, MWAA is behind, seemingly by 8mo the issue with @task.sensor was fixed but MWAA is still on 2.5.1

I still like the idea of passing the instance info along with the Dataset trigger, whatever that looks like in the final implementation - so there's no ambiguity about which instance's xcom info to fetch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient support for multiple daily DAGs to trigger one another via datasets with XCOM payload #33069

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Efficient support for multiple daily DAGs to trigger one another via datasets with XCOM payload #33069

scr-oath Aug 1, 2023

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

Replies: 6 comments · 11 replies

boring-cyborg[bot] bot Aug 1, 2023

RNHTTR Aug 2, 2023 Collaborator

jscheffl Aug 2, 2023 Collaborator

scr-oath Aug 3, 2023 Author

RNHTTR Aug 3, 2023 Collaborator

scr-oath Aug 3, 2023 Author

RNHTTR Aug 3, 2023 Collaborator

scr-oath Aug 3, 2023 Author

RNHTTR Aug 3, 2023 Collaborator

scr-oath Aug 3, 2023 Author

RNHTTR Aug 3, 2023 Collaborator

scr-oath Aug 3, 2023 Author

RNHTTR Aug 3, 2023 Collaborator

RNHTTR Aug 3, 2023 Collaborator

scr-oath Aug 4, 2023 Author

scr-oath Aug 4, 2023 Author

scr-oath Aug 4, 2023 Author

scr-oath
Aug 1, 2023

Replies: 6 comments 11 replies

boring-cyborg[bot]
bot Aug 1, 2023

RNHTTR
Aug 2, 2023
Collaborator

jscheffl
Aug 2, 2023
Collaborator

scr-oath
Aug 3, 2023
Author

RNHTTR Aug 3, 2023
Collaborator

scr-oath Aug 3, 2023
Author

RNHTTR Aug 3, 2023
Collaborator

scr-oath
Aug 3, 2023
Author

RNHTTR Aug 3, 2023
Collaborator

scr-oath Aug 3, 2023
Author

RNHTTR Aug 3, 2023
Collaborator

scr-oath
Aug 3, 2023
Author

RNHTTR Aug 3, 2023
Collaborator

RNHTTR Aug 3, 2023
Collaborator

scr-oath Aug 4, 2023
Author

scr-oath Aug 4, 2023
Author

scr-oath Aug 4, 2023
Author