fix: harden supervisor recovery and stuck scan#207
Open
Conversation
nvdtf
requested changes
Mar 13, 2026
holyfuchs
reviewed
Mar 18, 2026
67c0e5e to
eab7ad0
Compare
eab7ad0 to
999cd1d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why This PR Is Needed
This PR fixes two Supervisor liveness issues.
1. Duplicate recovery on healthy recurring vaults
FlowTransactionSchedulercan mark a scheduled transactionExecutedbefore the AutoBalancer handler finishes. In that in-flight window, the transaction is no longerScheduled, the next run may not yet be scheduled, and the Supervisor can falsely classify a healthy recurring vault as stuck.That causes duplicate recovery churn: the Supervisor tries to recover a vault that is already executing normally.
A pure
lastRebalanceTimestampcheck is not enough, because if the handler panics after the scheduler marks the txExecuted, that timestamp may never advance. The vault must become recoverable again instead of remaining "active" forever.2. Starvation in bounded stuck-scan
The Supervisor scans only a bounded tail window (
MAX_BATCH_SIZE) from the stuck-scan ordering.Admin disable flows can remove recurring config without removing the vault from that ordering. If enough stale non-recurring entries accumulate at the tail, they can consume the scan budget and delay or starve a real stuck recurring vault behind them.
Before / After
Before
ExecutedwindowAfter
Executedinternally-managed transactions count as active only for a short bounded grace windowScope / Semantics
This PR does not make
yieldVaultRegistryrecurring-only.After this PR:
yieldVaultRegistrystill tracks all live vaults known to the scheduler infrastructureThat keeps this PR focused on liveness/recovery. Making registry membership itself follow recurring lifecycle would require broader changes across enable / disable / recovery flows.
Out Of Scope
off -> onre-enable for a pruned vault; explicit rejoin support will land in a follow-up PRVerification
flow test cadence/tests/scheduler_mixed_population_regression_test.cdcflow test cadence/tests/scheduled_supervisor_test.cdc