-
There have been some amazing efforts from the community recently on improving Airflow Metrics, including the introduction of metric tags. These changes have allowed users to be able to collect metrics in a meaningful way, empowering them to be able to keep track of the health and the status of the cluster, and perhaps even enabling them to associate alarm rules associated to certain Reliability Metrics. Here are some example reliability metrics that can be used as meaningful indicators of the Airflow cluster health, when monitoring periodic / scheduled dags: Each of these metrics help engineers understand if the Airflow is scheduling dagruns and tasks when they are meant to be scheduled. And if these metrics are published in a reliable way, we empower engineers to be able to set up alarms when these metrics spike. Unfortunately, we face an interesting problem when dag runs are cleared. Dag Runs can be cleared on demand in order to initiate a re-run of the scheduled workflow. And engineers may opt to do so, for various valid reasons. Maybe the source data was corrected and the workflow needs to be relaunched. Maybe the initial run failed, and the engineer simply wants to relaunch the whole workflow to avoid any issues. Regardless of the reason, the current implementation leads to these reliability metrics spiking since there is no way for Airflow to know if these dag runs were cleared manually, and hence is not the initial scheduled dagrun that was launched without intervention. The inability to separately categorize these two cases (or to simply ignore the publication of these metrics for cleared dag runs) drastically reduces the value of these reliability metrics. When a reliability metric rule is triggered, it's important that it positively means that something is wrong and needs to be looked at. Having a caveat that says: 'the metric rule alarm means we need to take a look at the cluster, but it could also just be a false alarm because a dag run could have been cleared' will make it a much less reliable alarm. I have a couple of ideas that I think would address this issue, that will allow these reliability metrics to remain 'reliable'.
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I think adding a tag or otherwise separating clearing run from the regular run is a good idea. PRs are always welcome and if you and your team would like to contribute it, you are most welcome. Airflow is creaed by 2500 users and metrics is one of the area that I think having more people to contribute, especially if they are vitally interested in improving it and can reason and justify their proposal by checking the effect it would have on their metrics is a good idea. Note that with teh OpenTelemetry work: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow. There are plans to introduce Open Telemetry traces additionally to just metrics so what I strongly suggest is making your idea about open-telemetry implementation of metrics rathe than statsd one. This gives much more flexibiilty and adds much more capabilities in labelling, distinguishing and otherwise grouping various metrics so your idea about adding ways on distinguishing clear from regular metrics is likely fitting very well the overal direction AIP-49 has set for Airlfow's metrics. |
Beta Was this translation helpful? Give feedback.
-
@potiuk thank you for the thoughtful answer, and glad to hear that you are onboard with enabling Airflow and the users to be able distinguish the original scheduled run, from cleared dag runs. As always, I'm more than happy to contribute 👍 I have some follow up thoughts on how we could implement Idea (1) to add a tag. Currently, Airflow supports clearing Dag runs in one of three ways:
All three of these methods ultimately invoke dag.clear method and put the Dag Run back in a DagRunState.Queued state. And there is currently no way for the schedulers to know if the queued Dag Run is its original scheduled run, or has been cleared. I think in order for us to be able to let the scheduler that picks up the newly queued Dag Run know that this Dag was cleared, we will need to persist the information as a boolean flag in the Database. And I think we have two options here:
Do either of these sound like good options to you? |
Beta Was this translation helpful? Give feedback.
I think adding a tag or otherwise separating clearing run from the regular run is a good idea. PRs are always welcome and if you and your team would like to contribute it, you are most welcome. Airflow is creaed by 2500 users and metrics is one of the area that I think having more people to contribute, especially if they are vitally interested in improving it and can reason and justify their proposal by checking the effect it would have on their metrics is a good idea.
Note that with teh OpenTelemetry work: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow. There are plans to introduce Open Telemetry traces additionally to just metrics so …