Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with transaction recovery is possible during Citus upgrades #7875

Open
onurctirtir opened this issue Feb 4, 2025 · 1 comment
Open
Labels

Comments

@onurctirtir
Copy link
Member

onurctirtir commented Feb 4, 2025

This is especially possible for the upgrade paths where we are taking strong locks, as in citus--11.2-2--11.3-1.sql.

An example on how this upgrade path can cause a deadlock with transaction recovery:

  1. RecoverTwoPhaseCommits() acquires RowExclusiveLock on pg_dist_transaction via RecoverWorkerTransactions() and it doesn't release the lock until the end of the transaction as close the table with "NoLock".
  2. Upgrade script acquires AccessExclusiveLock on pg_dist_authinfo because of the following DDL:
    "ALTER TABLE pg_catalog.pg_dist_authinfo REPLICA IDENTITY USING INDEX pg_dist_authinfo_identification_index;"
  3. RecoverTwoPhaseCommits() continues to process the next worker via RecoverWorkerTransactions() and this implicitly tries to acquire AccessShareLock on pg_dist_authinfo via GetNodeConnection().
  4. Upgrade script tries to acquire AccessExclusiveLock on pg_dist_transaction because of the following DDL:
    "ALTER TABLE pg_catalog.pg_dist_transaction REPLICA IDENTITY USING INDEX pg_dist_transaction_unique_constraint;"

Now, while the process that is executing "ALTER EXTENSION citus UPDATE" is waiting to acquire AccessExclusiveLock on pg_dist_transaction while holding AccessExclusiveLock on pg_dist_authinfo; maintenance daemon is waiting to acquire AccessShareLock on pg_dist_authinfo while holding AccessExclusiveLock on pg_dist_authinfo.

As a result, the processes involved in the deadlock are cancelled by Postgres to resolve the deadlock.

Luckily, this doesn't leave the database in a bad state or such because Postgres implicitly executes the upgrade scripts within an implicit transaction block and a retry mostly helps.

But rather than retrying, until we properly fix this issue, disabling 2PC recovery during Citus upgrade seems like a more reliable workaround, as in;

ALTER SYSTEM SET citus.recover_2pc_interval TO -1;
SELECT pg_reload_conf();

ALTER EXTENSION citus UPDATE;

ALTER SYSTEM RESET citus.recover_2pc_interval;
SELECT pg_reload_conf();
@onurctirtir
Copy link
Member Author

Note to future dev:

Once we come up with a fix for that, we should also revert b6b73e2 from the branches where it presents while backporting the fix.

As of today, only main & release-13.0 has b6b73e2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant