Skip to content

fix: harden supervisor recovery and stuck scan#207

Open
liobrasil wants to merge 16 commits intomainfrom
lionel/fix-supervisor-scan-recovery
Open

fix: harden supervisor recovery and stuck scan#207
liobrasil wants to merge 16 commits intomainfrom
lionel/fix-supervisor-scan-recovery

Conversation

@liobrasil
Copy link
Contributor

@liobrasil liobrasil commented Mar 11, 2026

Why This PR Is Needed

This PR fixes two Supervisor liveness issues.

1. Duplicate recovery on healthy recurring vaults

FlowTransactionScheduler can mark a scheduled transaction Executed before the AutoBalancer handler finishes. In that in-flight window, the transaction is no longer Scheduled, the next run may not yet be scheduled, and the Supervisor can falsely classify a healthy recurring vault as stuck.

That causes duplicate recovery churn: the Supervisor tries to recover a vault that is already executing normally.

A pure lastRebalanceTimestamp check is not enough, because if the handler panics after the scheduler marks the tx Executed, that timestamp may never advance. The vault must become recoverable again instead of remaining "active" forever.

2. Starvation in bounded stuck-scan

The Supervisor scans only a bounded tail window (MAX_BATCH_SIZE) from the stuck-scan ordering.

Admin disable flows can remove recurring config without removing the vault from that ordering. If enough stale non-recurring entries accumulate at the tail, they can consume the scan budget and delay or starve a real stuck recurring vault behind them.

Before / After

Before

  • a vault could lose recurring config without leaving the Supervisor scan ordering
  • stale non-recurring entries could consume bounded tail-scan budget
  • healthy recurring vaults could be falsely classified as stuck during the optimistic Executed window

After

  • new scan participants are recurring-only, and stale non-recurring entries are pruned during bounded tail walks
  • recently Executed internally-managed transactions count as active only for a short bounded grace window

Scope / Semantics

This PR does not make yieldVaultRegistry recurring-only.

After this PR:

  • yieldVaultRegistry still tracks all live vaults known to the scheduler infrastructure
  • the Supervisor’s stuck-scan ordering is the recurring-only subset used for stuck detection

That keeps this PR focused on liveness/recovery. Making registry membership itself follow recurring lifecycle would require broader changes across enable / disable / recovery flows.

Out Of Scope

  • recurring off -> on re-enable for a pruned vault; explicit rejoin support will land in a follow-up PR
  • @holyfuchs review suggestion to make the registry itself contain only currently scheduled / recurring vaults instead of keeping a broader global registry plus a separate recurring-only stuck-scan state

Verification

  • flow test cadence/tests/scheduler_mixed_population_regression_test.cdc
  • flow test cadence/tests/scheduled_supervisor_test.cdc
  • duplicate recovery churn is rejected by the strengthened supervisor stress test
  • mixed recurring / non-recurring tail populations are verified not to starve real stuck recurring vault recovery

@liobrasil liobrasil force-pushed the lionel/fix-supervisor-scan-recovery branch from 67c0e5e to eab7ad0 Compare March 19, 2026 16:28
@liobrasil liobrasil force-pushed the lionel/fix-supervisor-scan-recovery branch from eab7ad0 to 999cd1d Compare March 19, 2026 18:22
@liobrasil liobrasil requested review from a team and nvdtf March 19, 2026 21:03
Base automatically changed from holyfuchs/supervisor-fix to main March 20, 2026 07:11
@liobrasil liobrasil requested a review from holyfuchs March 24, 2026 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants