Scheduler was dead when database service failover. #38279
Replies: 6 comments 11 replies
-
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Beta Was this translation helpful? Give feedback.
-
@robintian001 could you share some logs for the same. Also if my understanding is right when your mysql service is failing over it causes the scheduler to exit, if so that seems reasonable since it expects an active connection to be available for it to run? cc: @potiuk |
Beta Was this translation helpful? Give feedback.
-
The Scheduler's logs 2024-03-16 17:11:08,194 INFO - Adopting or resetting orphaned tasks for active dag runs The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): this message from microsoft support team It can be seen from the background connection log that your standby server restarts at 17:00 UTC on 3/16 for regular maintenance, and restarts at 17:07 UTC on 3/16 for about 7 minutes. In this case, your standby server will be failover to the new primary server for you to connect. At the same time, your original primary server restarts at 17:11 UTC on 3/16 and completes the restart at 17:19 UTC on 3/16, which takes about 8 minutes. From the connection log, your initial standby server restarts before the initial primary server, so your standby server that has restarted can connect before your initial primary server restarts. From the figure below about the conversion of HA server, you can see that at 17:12 UTC on 3/16, your HA server has been switched, which means that at this time your original standby server has become your new primary server available for connection. As you can see from the following figure, your total failover time is about 2 minutes. If you feel that you are not connected for about 20 minutes, it may be related to the retry mechanism of your client. |
Beta Was this translation helpful? Give feedback.
-
Yes. It's entirely expected for airflow to go down in such case. We hae no resilience for database missing. If soeon @robintian001 would like to add such resilance that would be an interesting AIP (Airflow Improvement Proposal) to write. |
Beta Was this translation helpful? Give feedback.
-
Converted to discussion. |
Beta Was this translation helpful? Give feedback.
-
Hi @robintian001, do you have a solution for this case now? We are encountering the same issue, where our internal MySQL might fail over and cause the scheduler job to exit. Would it be possible to introduce a DAO layer or something like a state backend to provide a unified interface for DB access? This layer could handle various scenarios, such as switchover, etc. cc @potiuk |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.7.1
What happened?
Scheduler was dead when database service failover.
What you think should happen instead?
No response
How to reproduce
The mysql service failover is triggered manually.
Operating System
Linux (rhel 9.2) RedHat
Versions of Apache Airflow Providers
Airflow 2.7.1
Deployment
Other
Deployment details
deploment on azure vm, the database is MySQL Flexible Server of Azure.
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions