Scheduler was dead when database service failover. #38279

robintian001 · 2024-03-18T09:32:00Z

robintian001
Mar 18, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.7.1

What happened?

Scheduler was dead when database service failover.

What you think should happen instead?

No response

How to reproduce

The mysql service failover is triggered manually.

Operating System

Linux (rhel 9.2) RedHat

Versions of Apache Airflow Providers

Airflow 2.7.1

Deployment

Other

Deployment details

deploment on azure vm, the database is MySQL Flexible Server of Azure.

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

2024-03-18T09:32:04Z

boring-cyborg[bot]
bot Mar 18, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

0 replies

aritra24 · 2024-03-19T07:33:55Z

aritra24
Mar 19, 2024
Collaborator

@robintian001 could you share some logs for the same. Also if my understanding is right when your mysql service is failing over it causes the scheduler to exit, if so that seems reasonable since it expects an active connection to be available for it to run? cc: @potiuk

0 replies

robintian001 · 2024-03-19T07:53:09Z

robintian001
Mar 19, 2024
Author

The Scheduler's logs

2024-03-16 17:11:08,194 INFO - Adopting or resetting orphaned tasks for active dag runs
2024-03-16 17:11:22,864 ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
return fn()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 327, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 498, in checkout
rec.checkin_failed(err, fairy_was_created=False)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dbapi_connection = rec.get_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
self.__connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 691, in connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
self.dbapi_connection = connection = pool._invoke_creator(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 574, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/init.py", line 121, in Connect
return Connection(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/connections.py", line 193, in init
super().init(*args, **kwargs2)
MySQLdb.OperationalError: (2003, "Can't connect to MySQL server on 'apec-mysql-flexible.mysql.database.azure.com:3306' (111)")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 845, in _execute
self._run_scheduler_loop()
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 979, in _run_scheduler_loop
num_queued_tis = self._do_scheduling(session)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 1053, in _do_scheduling
self._create_dagruns_for_dags(guard, session)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 91, in wrapped_function
for attempt in run_with_db_retries(max_retries=retries, logger=logger, **retry_kwargs):
File "/home/airflow-user/.local/lib/python3.9/site-packages/tenacity/init.py", line 347, in iter
do = self.iter(retry_state=retry_state)
File "/home/airflow-user/.local/lib/python3.9/site-packages/tenacity/init.py", line 325, in iter
raise retry_exc.reraise()
File "/home/airflow-user/.local/lib/python3.9/site-packages/tenacity/init.py", line 158, in reraise
raise self.last_attempt.result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 100, in wrapped_function
return func(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 1120, in _create_dagruns_for_dags
query, dataset_triggered_dag_info = DagModel.dags_needing_dagruns(session)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/models/dag.py", line 3615, in dags_needing_dagruns
for x in session.execute(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1716, in execute
conn = self._connection_for_bind(bind)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1555, in _connection_for_bind
return self._transaction._connection_for_bind(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 750, in _connection_for_bind
conn = bind.connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/future/engine.py", line 406, in connect
return super(Engine, self).connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3325, in connect
return self._connection_cls(self, close_with_result=close_with_result)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in init
else engine.raw_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3404, in raw_connection
return self._wrap_pool_connect(self.pool.connect, _connection)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3374, in _wrap_pool_connect
Connection.handle_dbapi_exception_noconnection(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2208, in handle_dbapi_exception_noconnection
util.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
return fn()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 327, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 498, in checkout
rec.checkin_failed(err, fairy_was_created=False)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dbapi_connection = rec.get_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
self.__connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 691, in connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
self.dbapi_connection = connection = pool._invoke_creator(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 574, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/init.py", line 121, in Connect
return Connection(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/connections.py", line 193, in init
super().init(*args, **kwargs2)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (2003, "Can't connect to MySQL server on 'apec-mysql-flexible.mysql.database.azure.com:3306' (111)")
(Background on this error at: https://sqlalche.me/e/14/e3q8)
2024-03-16 17:11:22,868 INFO - Shutting down LocalExecutor; waiting for running tasks to finish. Signal again if you don't want to wait.
2024-03-16 17:11:22,910 INFO - Sending Signals.SIGTERM to group 1732789. PIDs of all processes in the group: []
2024-03-16 17:11:22,910 INFO - Sending the signal Signals.SIGTERM to group 1732789
2024-03-16 17:11:22,910 INFO - Sending the signal Signals.SIGTERM to process 1732789 as process group is missing.
2024-03-16 17:11:22,910 INFO - Exited execute loop
2024-03-16 17:11:23,161 ERROR - Exception when running scheduler job
Traceback (most recent call last):
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
return fn()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 327, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 498, in checkout
rec.checkin_failed(err, fairy_was_created=False)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dbapi_connection = rec.get_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
self.__connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 691, in connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
self.dbapi_connection = connection = pool._invoke_creator(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 574, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/init.py", line 121, in Connect
return Connection(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/connections.py", line 193, in init
super().init(*args, **kwargs2)
MySQLdb.OperationalError: (2003, "Can't connect to MySQL server on 'apec-mysql-flexible.mysql.database.azure.com:3306' (111)")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/job.py", line 289, in run_job
return execute_job(job, execute_callable=execute_callable)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/job.py", line 318, in execute_job
ret = execute_callable()
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 845, in _execute
self._run_scheduler_loop()
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 979, in _run_scheduler_loop
num_queued_tis = self._do_scheduling(session)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 1053, in _do_scheduling
self._create_dagruns_for_dags(guard, session)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 91, in wrapped_function
for attempt in run_with_db_retries(max_retries=retries, logger=logger, **retry_kwargs):
File "/home/airflow-user/.local/lib/python3.9/site-packages/tenacity/init.py", line 347, in iter
do = self.iter(retry_state=retry_state)
File "/home/airflow-user/.local/lib/python3.9/site-packages/tenacity/init.py", line 325, in iter
raise retry_exc.reraise()
File "/home/airflow-user/.local/lib/python3.9/site-packages/tenacity/init.py", line 158, in reraise
raise self.last_attempt.result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 100, in wrapped_function
return func(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job_runner.py", line 1120, in _create_dagruns_for_dags
query, dataset_triggered_dag_info = DagModel.dags_needing_dagruns(session)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/models/dag.py", line 3615, in dags_needing_dagruns
for x in session.execute(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1716, in execute
conn = self._connection_for_bind(bind)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1555, in _connection_for_bind
return self._transaction._connection_for_bind(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 750, in _connection_for_bind
conn = bind.connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/future/engine.py", line 406, in connect
return super(Engine, self).connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3325, in connect
return self._connection_cls(self, close_with_result=close_with_result)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in init
else engine.raw_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3404, in raw_connection
return self._wrap_pool_connect(self.pool.connect, _connection)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3374, in _wrap_pool_connect
Connection.handle_dbapi_exception_noconnection(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2208, in handle_dbapi_exception_noconnection
util.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
return fn()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 327, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 498, in checkout
rec.checkin_failed(err, fairy_was_created=False)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dbapi_connection = rec.get_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
self.__connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 691, in connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
self.dbapi_connection = connection = pool._invoke_creator(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 574, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/init.py", line 121, in Connect
return Connection(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/connections.py", line 193, in init
super().init(*args, **kwargs2)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (2003, "Can't connect to MySQL server on 'apec-mysql-flexible.mysql.database.azure.com:3306' (111)")
(Background on this error at: https://sqlalche.me/e/14/e3q8)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
return fn()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 327, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 498, in checkout
rec.checkin_failed(err, fairy_was_created=False)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dbapi_connection = rec.get_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
self.__connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 691, in connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
self.dbapi_connection = connection = pool._invoke_creator(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 574, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/init.py", line 121, in Connect
return Connection(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/connections.py", line 193, in init
super().init(*args, **kwargs2)
MySQLdb.OperationalError: (2003, "Can't connect to MySQL server on 'apec-mysql-flexible.mysql.database.azure.com:3306' (111)")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/cli/commands/scheduler_command.py", line 47, in _run_scheduler_job
run_job(job=job_runner.job, execute_callable=job_runner._execute)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 77, in wrapper
return func(*args, session=session, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/job.py", line 291, in run_job
job.complete_execution(session=session)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 74, in wrapper
return func(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/airflow/jobs/job.py", line 237, in complete_execution
session.merge(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3056, in merge
return self._merge(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3136, in _merge
merged = self.get(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2853, in get
return self._get_impl(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2975, in _get_impl
return db_load_fn(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 530, in load_on_pk_identity
session.execute(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1716, in execute
conn = self._connection_for_bind(bind)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1555, in _connection_for_bind
return self._transaction._connection_for_bind(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 750, in _connection_for_bind
conn = bind.connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/future/engine.py", line 406, in connect
return super(Engine, self).connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3325, in connect
return self._connection_cls(self, close_with_result=close_with_result)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 96, in init
else engine.raw_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3404, in raw_connection
return self._wrap_pool_connect(self.pool.connect, _connection)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3374, in _wrap_pool_connect
Connection.handle_dbapi_exception_noconnection(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2208, in handle_dbapi_exception_noconnection
util.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
return fn()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 327, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 498, in checkout
rec.checkin_failed(err, fairy_was_created=False)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 495, in checkout
dbapi_connection = rec.get_connection()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 630, in get_connection
self.__connect()
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 691, in connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in exit
compat.raise(
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise
raise exception
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
self.dbapi_connection = connection = pool._invoke_creator(self)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/create.py", line 574, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 598, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/init.py", line 121, in Connect
return Connection(*args, **kwargs)
File "/home/airflow-user/.local/lib/python3.9/site-packages/MySQLdb/connections.py", line 193, in init
super().init(*args, **kwargs2)
sqlalchemy.exc.OperationalError: (MySQLdb.OperationalError) (2003, "Can't connect to MySQL server on 'apec-mysql-flexible.mysql.database.azure.com:3306' (111)")
(Background on this error at: https://sqlalche.me/e/14/e3q8)

this message from microsoft support team

It can be seen from the background connection log that your standby server restarts at 17:00 UTC on 3/16 for regular maintenance, and restarts at 17:07 UTC on 3/16 for about 7 minutes. In this case, your standby server will be failover to the new primary server for you to connect. At the same time, your original primary server restarts at 17:11 UTC on 3/16 and completes the restart at 17:19 UTC on 3/16, which takes about 8 minutes. From the connection log, your initial standby server restarts before the initial primary server, so your standby server that has restarted can connect before your initial primary server restarts.

From the figure below about the conversion of HA server, you can see that at 17:12 UTC on 3/16, your HA server has been switched, which means that at this time your original standby server has become your new primary server available for connection. As you can see from the following figure, your total failover time is about 2 minutes. If you feel that you are not connected for about 20 minutes, it may be related to the retry mechanism of your client.

0 replies

potiuk · 2024-03-19T08:35:26Z

potiuk
Mar 19, 2024
Collaborator

Yes. It's entirely expected for airflow to go down in such case. We hae no resilience for database missing. If soeon @robintian001 would like to add such resilance that would be an interesting AIP (Airflow Improvement Proposal) to write.

6 replies

robintian001 Mar 19, 2024
Author

When we troubleshoot the problem, we observed the entire process of Airflow. First, the Web Server restored normal after the database was reachable, but the Scheduler process crashed.

potiuk Mar 19, 2024
Collaborator

Yes. that's expected. And the right way to recover is to restart airflow scheduler when it crashes. Ths is a perfectly valid way of dealing with the problem. But if your team wants to increase the robustness, you are most welcome to design, propose and implement proposal on how to do such failover more robust. Airflow is developed by > 2800 contributors, so if your team is interested in it, it's a great way to contribute back for the free software you have, to contribute such feature. It's a lot of work, but if you are committed to do it, and would like to dedicate some engineering time, that would be fantastic.

robintian001 Mar 20, 2024
Author

After my thinking and reading the airflow source code, I thought we could leave max in the wait=tenacity.wait_random_exponential(multiplier=0.5, max=5) statement in the run_with_db_retries function to the user to define for example:

potiuk Mar 20, 2024
Collaborator

It will not help in all cases. We only do DB retries in selected cases when we know it will not be problematic because it will potentially have problems with state stored in memory. There is a big number of SQL queries run by scheduler that do not have retries - precisely because of the potentially impacting state that is stored in memory.

Generally implementing (and testing) resilience against potentially longer time of DB unavailability will likely require analysis and handling of all potential queries run by scheduler and implementing specific solution to make sure that it can survive such cases. Plus it will require extensive test harness to test such cases if it were to be maintainable in the future as well.

Implementing and designing such an approach should be documented in AIP and very thoroughly examined and tested.

You have to remember that Airflow is an open-source project that is developed by volunteers, and any case where you implement something like that need to be thorouhgly and automatically tested using test suite in CI - because we have no team of testers that would take every future release of Airflow through a set of manual testing those kind of automatic failover scenarios.

Another option (other than contributing automatic test harness to test all the cases) is for company like yours to commit such manual testing time and effort - but it would be extremely difficult to coordinate and it would hold us back from making releases as frequently as we woudl like to, so it's rather impossible.

As opposed to thet - exiting Airflow when DB connection is lost and letting deployment to monitor it and restart is far simpler solution that does not require extensive testing, because the state stored in memory is discarded automatically when scheduler is shutdown.

So you have two choices:

implement handling of the shutdown of airflow in such cases in your deployment
or contribute design, implementation and testing to make sure that airflow is handling prolonged times of DB being not available without restart

That's basically the choice you have.

robintian001 Mar 20, 2024
Author

Thank you very much for your suggestion. We will test our scheme in our own environment

potiuk · 2024-03-19T08:35:38Z

potiuk
Mar 19, 2024
Collaborator

Converted to discussion.

0 replies

TakawaAkirayo · 2024-08-13T08:47:47Z

TakawaAkirayo
Aug 13, 2024

Hi @robintian001, do you have a solution for this case now?

We are encountering the same issue, where our internal MySQL might fail over and cause the scheduler job to exit. Would it be possible to introduce a DAO layer or something like a state backend to provide a unified interface for DB access? This layer could handle various scenarios, such as switchover, etc.

cc @potiuk

5 replies

robintian001 Oct 15, 2024
Author

Hi @TakawaAkirayo .
We now have a crude solution, which is to start a process on the master that listens to the scheduler process, and automatically restart a scheduler process when it is found that the scheduler process does not exist.

potiuk Oct 15, 2024
Collaborator

We are encountering the same issue, where our internal MySQL might fail over and cause the scheduler job to exit. Would it be possible to introduce a DAO layer or something like a state backend to provide a unified interface for DB access? This layer could handle various scenarios, such as switchover, etc.

Surely. Usually such systems are order of magnitude more complex than when they don't do it - to handle various cases. So if you someone (your company) can sponsor development of it, and get the team dedicated for it for a few months of work, then yes, surely it could be done.

Alternative is to restart failed process when the monitoring of that process detects that it fails. Which is a standard approach for managing running applications for as long as I am doing an IT work - and pretty much all standard ways of running long-time running applicaiton do it - starting from systemd and ending with Kubernetes that even has liveness probes that not only check if process is running but also whether it is "alive" - and you can working it out of the box if you are choosing to run one of those 20 and 10 years old solutions.

robintian001 Oct 15, 2024
Author

We are currently deployed in VM, we have also studied deployment in K8S, but found that one component has a memory leak bug, which often triggers our monitoring alarms, so we have not tried this again for the time being, if the probe can be used in the container to complete the survival check and complete the restart, for now, I think using a process (or probe) to restart the scheduler process can meet the current availability requirements.

TakawaAkirayo Oct 15, 2024

@robintian001 Thanks for the response. We did something similar. We now use supervisord to start the Airflow scheduler along with other critical components. It monitors the processes and restarts them if they fail.

TakawaAkirayo Oct 15, 2024

@potiuk Thank you for the response! We are planning to migrate a significant amount of heavy workloads from other scheduler systems to Airflow. There are critical workflows with strict SLA/SLE requirements, so stability is paramount. We've decided to invest resources into this, such as introducing a DAO layer(Or internal API in the Airflow's roadmap), and other improvements are underway. We don't want to diverge too much from the community, so it's good to know that some of our ideas align with the community's direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler was dead when database service failover. #38279

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Scheduler was dead when database service failover. #38279

robintian001 Mar 18, 2024

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Replies: 6 comments · 11 replies

boring-cyborg[bot] bot Mar 18, 2024

aritra24 Mar 19, 2024 Collaborator

robintian001 Mar 19, 2024 Author

potiuk Mar 19, 2024 Collaborator

robintian001 Mar 19, 2024 Author

potiuk Mar 19, 2024 Collaborator

robintian001 Mar 20, 2024 Author

potiuk Mar 20, 2024 Collaborator

robintian001 Mar 20, 2024 Author

potiuk Mar 19, 2024 Collaborator

TakawaAkirayo Aug 13, 2024

robintian001 Oct 15, 2024 Author

potiuk Oct 15, 2024 Collaborator

robintian001 Oct 15, 2024 Author

TakawaAkirayo Oct 15, 2024

TakawaAkirayo Oct 15, 2024

robintian001
Mar 18, 2024

Replies: 6 comments 11 replies

boring-cyborg[bot]
bot Mar 18, 2024

aritra24
Mar 19, 2024
Collaborator

robintian001
Mar 19, 2024
Author

potiuk
Mar 19, 2024
Collaborator

robintian001 Mar 19, 2024
Author

potiuk Mar 19, 2024
Collaborator

robintian001 Mar 20, 2024
Author

potiuk Mar 20, 2024
Collaborator

robintian001 Mar 20, 2024
Author

potiuk
Mar 19, 2024
Collaborator

TakawaAkirayo
Aug 13, 2024

robintian001 Oct 15, 2024
Author

potiuk Oct 15, 2024
Collaborator

robintian001 Oct 15, 2024
Author