Clearing Dag Runs and its Impact on Reliability Metrics #33677

sungwy · 2023-08-23T21:24:10Z

sungwy
Aug 23, 2023

There have been some amazing efforts from the community recently on improving Airflow Metrics, including the introduction of metric tags.

These changes have allowed users to be able to collect metrics in a meaningful way, empowering them to be able to keep track of the health and the status of the cluster, and perhaps even enabling them to associate alarm rules associated to certain Reliability Metrics.

Here are some example reliability metrics that can be used as meaningful indicators of the Airflow cluster health, when monitoring periodic / scheduled dags:

Each of these metrics help engineers understand if the Airflow is scheduling dagruns and tasks when they are meant to be scheduled. And if these metrics are published in a reliable way, we empower engineers to be able to set up alarms when these metrics spike.

Unfortunately, we face an interesting problem when dag runs are cleared.

Dag Runs can be cleared on demand in order to initiate a re-run of the scheduled workflow. And engineers may opt to do so, for various valid reasons. Maybe the source data was corrected and the workflow needs to be relaunched. Maybe the initial run failed, and the engineer simply wants to relaunch the whole workflow to avoid any issues. Regardless of the reason, the current implementation leads to these reliability metrics spiking since there is no way for Airflow to know if these dag runs were cleared manually, and hence is not the initial scheduled dagrun that was launched without intervention.

The inability to separately categorize these two cases (or to simply ignore the publication of these metrics for cleared dag runs) drastically reduces the value of these reliability metrics. When a reliability metric rule is triggered, it's important that it positively means that something is wrong and needs to be looked at. Having a caveat that says: 'the metric rule alarm means we need to take a look at the cluster, but it could also just be a false alarm because a dag run could have been cleared' will make it a much less reliable alarm.
As old as the saying goes, we all know that users are less likely to pay attention to The Boy Who Cried Wolf. A metric that leads to a lot of noise and false positives is maybe just as bad or worse than not having the metric.

I have a couple of ideas that I think would address this issue, that will allow these reliability metrics to remain 'reliable'.

We could introduce a dagrun specific parameter that disables period / reliability metric alarms. In conjunction, we could also introduce an optional configuration in airflow.cfg that allows engineers to decide whether clearing dagruns should set this flag, and disable the publication of periodic reliability metrics the next time the dagrun is updated to RUNNING state.
Remove the unique constraint on dag_id, execution_date. Then, we could recommend users to submit a new MANUAL triggered dagrun corresponding to the same execution_date instead of clearing an existing dag_run, if they wish to issue a manual rerun. This will enable us to rely on the MANUAL run_type of the dagrun to avoid the publication of the reliability metrics. However, removing a unique constraint is serious business, especially if Airflow / or users makes assumptions based on it.

Answered by potiuk

Aug 24, 2023

I think adding a tag or otherwise separating clearing run from the regular run is a good idea. PRs are always welcome and if you and your team would like to contribute it, you are most welcome. Airflow is creaed by 2500 users and metrics is one of the area that I think having more people to contribute, especially if they are vitally interested in improving it and can reason and justify their proposal by checking the effect it would have on their metrics is a good idea.

Note that with teh OpenTelemetry work: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow. There are plans to introduce Open Telemetry traces additionally to just metrics so …

View full answer

potiuk · 2023-08-24T05:02:07Z

potiuk
Aug 24, 2023
Collaborator

I think adding a tag or otherwise separating clearing run from the regular run is a good idea. PRs are always welcome and if you and your team would like to contribute it, you are most welcome. Airflow is creaed by 2500 users and metrics is one of the area that I think having more people to contribute, especially if they are vitally interested in improving it and can reason and justify their proposal by checking the effect it would have on their metrics is a good idea.

Note that with teh OpenTelemetry work: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow. There are plans to introduce Open Telemetry traces additionally to just metrics so what I strongly suggest is making your idea about open-telemetry implementation of metrics rathe than statsd one. This gives much more flexibiilty and adds much more capabilities in labelling, distinguishing and otherwise grouping various metrics so your idea about adding ways on distinguishing clear from regular metrics is likely fitting very well the overal direction AIP-49 has set for Airlfow's metrics.

0 replies

sungwy · 2023-08-24T16:07:07Z

sungwy
Aug 24, 2023
Author

@potiuk thank you for the thoughtful answer, and glad to hear that you are onboard with enabling Airflow and the users to be able distinguish the original scheduled run, from cleared dag runs.

As always, I'm more than happy to contribute 👍

I have some follow up thoughts on how we could implement Idea (1) to add a tag. Currently, Airflow supports clearing Dag runs in one of three ways:

REST API endpoint clear
UI button clearExistingTasks
Task Clear cli command

All three of these methods ultimately invoke dag.clear method and put the Dag Run back in a DagRunState.Queued state. And there is currently no way for the schedulers to know if the queued Dag Run is its original scheduled run, or has been cleared.

I think in order for us to be able to let the scheduler that picks up the newly queued Dag Run know that this Dag was cleared, we will need to persist the information as a boolean flag in the Database. And I think we have two options here:

We could introduce a new boolean flag to Dag Run named 'cleared'. And if a user clears the entire dagrun (instead of a subset of the dag run) we can mark this flag, and use it to tag the metric, or completely disable it.
There is an existing boolean flag named 'external_trigger', that I feel is a bit redundant given that we have DATASET_TRIGGERED an MANUAL DagRun.run_type. Maybe we could repurpose this flag to mean that a human has intervened on the state of this dagrun? This will allow us to avoid introducing another flag, but I understand that 'trigger' is already a very loaded terminology in Airflow. I wonder if it would be wise to have it mean something else.

Do either of these sound like good options to you?

2 replies

potiuk Aug 24, 2023
Collaborator

I'd say a new boolean flag is good. And before diving too much of a discussion, I think this is small enough feature, that best way to discuss it is to start a PR with draft changes implementing it, and asking for review and opininions there. Discussing over code is always better than "in a void" - except big feature discussions that should happen in Devlist and later as Airflow Improvement proposals. Here I think investment in making a draft based on your proposal 1 would be well spent as you will get familiar with the Airflow code more and further discussions will be more productive even if from discussion it will come up that we need to change the approach. Often discussions before writing a draft PR/code are counter-productive for such changes, becasue after you start implementing it, it will turn out that we have not thought about something, or that something is not feasible.

I sugggest to go that route.

sungwy Aug 24, 2023
Author

Sounds good 💯 Thank you for the engagement @potiuk I will put up a PR some time this week.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clearing Dag Runs and its Impact on Reliability Metrics #33677

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Clearing Dag Runs and its Impact on Reliability Metrics #33677

sungwy Aug 23, 2023

Replies: 2 comments · 2 replies

potiuk Aug 24, 2023 Collaborator

sungwy Aug 24, 2023 Author

potiuk Aug 24, 2023 Collaborator

sungwy Aug 24, 2023 Author

sungwy
Aug 23, 2023

Replies: 2 comments 2 replies

potiuk
Aug 24, 2023
Collaborator

sungwy
Aug 24, 2023
Author

potiuk Aug 24, 2023
Collaborator

sungwy Aug 24, 2023
Author