Replies: 14 comments
-
Please provide some evidences - excerpt of the log files, DAG examples - something that provides minimally reproducible example of what you observe. Without it you are asking for a lot of time of others trying to guess and reproduce what you observe. By providing those details you are increasing your chances that someone will actually have a look, spend time and attempt to diagnose it. Without those - the chances are slim. |
Beta Was this translation helpful? Give feedback.
-
Well it is pretty difficult to do so.. I can provide logs: First task try:
second task try:
|
Beta Was this translation helpful? Give feedback.
-
It even more difficult to help without it. But keep trying - maybe it will be enough for someone to decide to spend their time on it and see if there is enough information they can help with |
Beta Was this translation helpful? Give feedback.
-
If you describe more circumstances and try to detail more information DAGs etc, your chances are going up. |
Beta Was this translation helpful? Give feedback.
-
DAG default arguments: default_args = {
'owner': 'some-team',
'depends_on_past': False,
'start_date': datetime.datetime(2023, 3, 31),
'sla': datetime.timedelta(hours = 3, minutes = 30),
'execution_timeout': datetime.timedelta(hours = 3),
'email_on_failure': False,
'email_on_retry': False,
'retries': 3,
'retry_delay': datetime.timedelta(minutes = 1),
'retry_exponential_backoff': True,
'max_retry_delay': datetime.timedelta(minutes = 10),
'max_active_tis_per_dag': 8,
'pool': 'SomePool',
} Using regular PythonOperator with function that uses |
Beta Was this translation helpful? Give feedback.
-
From the logs itself I can confirm in reading time stamps and parallel execution is not desired. But with this information it is not possible to discover any root cause. As I was also debugging deep in scheduler code I very much assume some side effect is causing this. Without understanding the root cause it is hard to think about how to fix such concurrency in a distributed system w/o a central lock (which is a design feature, not a flaw). I have some more questions on this, this might help finding the root cause:
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Never, Ever, under ANY circumstances use Sequential executor for anything else that quick testing / working with SQLite. Plesse DO NOT open issues that your tasks do not work with it for anything that even closely resembles production. You have enough warnings about it in the documentation, logs, and even warning displayed prominently in the UI of airflow that you should not do it. You are shooting yourself in the foot by ignoring all that warnings. AND you are completely wasting time of people who want to help others here with their real issues. Please stop doing it and ignorring very clear warnigns you have. |
Beta Was this translation helpful? Give feedback.
-
Sorry, my bad. We have |
Beta Was this translation helpful? Give feedback.
-
Please then be careful when you provide the answers. Accuracy of your answer determine whether people trying to help you will spend more or less time on trying to - actually - spend time on trying to help you in their free time. Please take it into account, and don't be surprised when given wrong answers migh get the time investment from those people cut short. Also this very much seems like your environmental problem (that's why I converted it into discussion). Your answers still seem quite incosistent even after you corrected the Sequential executor. You sad you have 3 schedulers and 1 worker but then you write that you don't use Celery but sequential (now Local) executor. That does not add up. There are no workers, when there is no celery executor. Maybe your configuration description and answer contain other flaws? For now can you confirm that you are using 3 Schedulers and Local Executor? Can you check another traces and see if potential hypotheses might be right:
|
Beta Was this translation helpful? Give feedback.
-
What I am not sure and never thought about - when we have set-up multiple schedulers and workers I always think of using Celery or similar which generates a unique work queue. Do I understand it right that there are 3 schedulers involved and local executors, so we have 3 schedulers and each carries its own LocalExecutor? |
Beta Was this translation helpful? Give feedback.
-
Generally I don't see why it should not be supported. I've always "toyed" with the idea that this is a nice way of having scalable, distributed scheduling without all the complexity of Celery /K8s executor. There is nothing "essentially" preventing it, though it might be that there are some cases where it might show some problems. This is currently one of my hypothesis, and this is why I asked quiestions 2) and 3) to get more evidences of what is happening. And this again underlines how important it is to provide "proper" answers to our questions - when they are misleading, we are loosing time on going into dead-ends when we try to reason about a problem. |
Beta Was this translation helpful? Give feedback.
-
@potiuk and @jens-scheffler-bosch I'll provide some details on our set up, hopefully it can help! Environment InfoAirflow Version: 2.5.0
1
Redis and Celery
Yes
I do not see any restarts during the runtime of the task
I do not know what is causing this :(
We had this happen across multiple DAGs
I can, I've saved the logs out of Azure (It's about 15k lines in CSV format for a 15 minute period).
orphaned_tasks_check_interval=300.0
The first attempt ran for 2 hours and was timed-out. The second attempt was started about 35 minutes in and ran for 1h24m before being killed due to the first attempt setting the task as failed
No Here are the task attempt logs related to this: Attempt 1There was nothing special around the time of the second attempt starting
Attempt 2
I looked through the logs, and in the scheduler I noticed this:
It looks to me that, in my case at least, the cause was from a failed heartbeat for the task job. I'm unsure of what caused the heart to stop beating, but I did see a lot of the following in the logs for worker-2 where the 1st attempt was run, and none in the logs for worker-3 where the 2nd attempt was run. According to logs, for the specific example shown, it was run successfully on worker-2 40 seconds later, and ran for 23 seconds. Also, even though these events occurred repeatedly from about 16:05 to 16:15, many task instances ran successfully on worker-2 during that time. Worker-2 logs
I hope this is all helpful! Let me know if there's anything I missed that would also be useful to add. This happened in our production Airflow where we currently have 122 active DAGs with dozens to hundreds of task instances per dag run, thanks to task mapping! |
Beta Was this translation helpful? Give feedback.
-
Any update @tarper24 ? did increasing various heartbeat intervals solve the problem for you? I seems very similar issues are still happening on |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
2.6.3
What happened
We experience an issue with Airflow scheduling another task try before the previous (first) task instance even ended. E.g. first task instance ended 2:41:46 (SUCCESSfully !) and Airflow started another task try at 2:41:06
What you think should happen instead
No scheduling of another try before previous ended
How to reproduce
Use PythonOperator inside TaskGroup with Airflow 2.6.3
Operating System
Debian GNU/Linux 10 (buster)
Versions of Apache Airflow Providers
Deployment
Other Docker-based deployment
Deployment details
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions