Exception in _schedule_dag_run() due to misconfigured task instance can completely crash scheduling loop #47042
Unanswered
karenbraganz
asked this question in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
One of Astronomer's customers experienced an issue wherein the scheduling loop crashed due to the way a single task was configured. This had an impact on all running DAGs since the scheduler was crashing and could not complete its normal functions. A misconfigured DAG or task instance should not be able to break the scheduling loop and impact other DAGs. Instead, such exceptions should be caught allowing the scheduling loop to continue for other DAGs.
I have not been able to reproduce the issue yet but am starting a discussion to document the details of this issue and discuss this with the Airflow community.
When the issue occurred, we noticed that the scheduler was crashlooping with the below traceback logged:
The
_schedule_dag_run
method in the traceback suggested that this originated from the scheduling of a specific DAG run.In order to isolate which DAG run was causing this, we ran the following code in the standalone DAG processor, which was not crashlooping:
This code loops through all running DAGs and calls the
task_instance_scheduling_decisions()
method (which is downstream to _schedule_dag_run in the traceback) for each DAG run. This also prints the DAG ID and run ID in each iteration of the loop so that we can identify which DAG is responsible when the issue arises during the loop.When we ran this code in the DAG processor, we saw the same traceback and the print statement helped us identify which DAG was causing it. As soon as we manually marked this running DAG as failed and paused the DAG, the issue was resolved (scheduler stopped crashing and was able to schedule other DAG runs) confirming that it was indeed that DAG that was breaking the scheduling loop.
We were also able to isolate the task instance causing this with the below code:
The task instance is configured like below (I have swapped out the task ID and some of the other names):
We suspect it is either the
python_callable
orexecutor_config
that is causing this but have not found the root cause. Theexecutor_config
returns the deepcopy of a dictionary that contains several keys besides thepod_override
key. This is not howexecutor_config
should be defined. However, simply definingexecutor_config
in this manner does not trigger the issue (according to tests that I ran).These are all the details we have so far. Please respond if you have any thoughts/ questions or have experienced a similar issue.
Beta Was this translation helpful? Give feedback.
All reactions