Deadlock with transaction recovery is possible during Citus upgrades #7875

onurctirtir · 2025-02-04T07:52:44Z

This is especially possible for the upgrade paths where we are taking strong locks, as in citus--11.2-2--11.3-1.sql.

An example on how this upgrade path can cause a deadlock with transaction recovery:

RecoverTwoPhaseCommits() acquires RowExclusiveLock on pg_dist_transaction via RecoverWorkerTransactions() and it doesn't release the lock until the end of the transaction as close the table with "NoLock".
Upgrade script acquires AccessExclusiveLock on pg_dist_authinfo because of the following DDL:
"ALTER TABLE pg_catalog.pg_dist_authinfo REPLICA IDENTITY USING INDEX pg_dist_authinfo_identification_index;"
RecoverTwoPhaseCommits() continues to process the next worker via RecoverWorkerTransactions() and this implicitly tries to acquire AccessShareLock on pg_dist_authinfo via GetNodeConnection().
Upgrade script tries to acquire AccessExclusiveLock on pg_dist_transaction because of the following DDL:
"ALTER TABLE pg_catalog.pg_dist_transaction REPLICA IDENTITY USING INDEX pg_dist_transaction_unique_constraint;"

Now, while the process that is executing "ALTER EXTENSION citus UPDATE" is waiting to acquire AccessExclusiveLock on pg_dist_transaction while holding AccessExclusiveLock on pg_dist_authinfo; maintenance daemon is waiting to acquire AccessShareLock on pg_dist_authinfo while holding AccessExclusiveLock on pg_dist_authinfo.

As a result, the processes involved in the deadlock are cancelled by Postgres to resolve the deadlock.

Luckily, this doesn't leave the database in a bad state or such because Postgres implicitly executes the upgrade scripts within an implicit transaction block and a retry mostly helps.

But rather than retrying, until we properly fix this issue, disabling 2PC recovery during Citus upgrade seems like a more reliable workaround, as in;

ALTER SYSTEM SET citus.recover_2pc_interval TO -1;
SELECT pg_reload_conf();

ALTER EXTENSION citus UPDATE;

ALTER SYSTEM RESET citus.recover_2pc_interval;
SELECT pg_reload_conf();

onurctirtir · 2025-02-04T14:00:06Z

Note to future dev:

Once we come up with a fix for that, we should also revert b6b73e2 from the branches where it presents while backporting the fix.

As of today, only main & release-13.0 has b6b73e2.

onurctirtir added the bug label Feb 4, 2025

onurctirtir mentioned this issue Feb 4, 2025

Revert "Release RowExclusiveLock on pg_dist_transaction as soon as remote xacts are recovered" #7877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock with transaction recovery is possible during Citus upgrades #7875

Deadlock with transaction recovery is possible during Citus upgrades #7875

onurctirtir commented Feb 4, 2025 •

edited

Loading

onurctirtir commented Feb 4, 2025

Deadlock with transaction recovery is possible during Citus upgrades #7875

Deadlock with transaction recovery is possible during Citus upgrades #7875

Comments

onurctirtir commented Feb 4, 2025 • edited Loading

onurctirtir commented Feb 4, 2025

onurctirtir commented Feb 4, 2025 •

edited

Loading