fix: harden supervisor recovery and stuck scan by liobrasil · Pull Request #207 · onflow/FlowYieldVaults

liobrasil · 2026-03-11T04:56:47Z

Why This PR Is Needed

This PR fixes two Supervisor liveness issues.

1. Duplicate recovery on healthy recurring vaults

FlowTransactionScheduler can mark a scheduled transaction Executed before the AutoBalancer handler finishes. In that in-flight window, the transaction is no longer Scheduled, the next run may not yet be scheduled, and the Supervisor can falsely classify a healthy recurring vault as stuck.

That causes duplicate recovery churn: the Supervisor tries to recover a vault that is already executing normally.

A pure lastRebalanceTimestamp check is not enough, because if the handler panics after the scheduler marks the tx Executed, that timestamp may never advance. The vault must become recoverable again instead of remaining "active" forever.

2. Starvation in bounded stuck-scan

The Supervisor scans only a bounded tail window (MAX_BATCH_SIZE) from the stuck-scan ordering.

Admin disable flows can remove recurring config without removing the vault from that ordering. If enough stale non-recurring entries accumulate at the tail, they can consume the scan budget and delay or starve a real stuck recurring vault behind them.

Before / After

Before

a vault could lose recurring config without leaving the Supervisor scan ordering
stale non-recurring entries could consume bounded tail-scan budget
healthy recurring vaults could be falsely classified as stuck during the optimistic Executed window

After

new scan participants are recurring-only, and stale non-recurring entries are pruned during bounded tail walks
recently Executed internally-managed transactions count as active only for a short bounded grace window

Scope / Semantics

This PR does not make yieldVaultRegistry recurring-only.

After this PR:

yieldVaultRegistry still tracks all live vaults known to the scheduler infrastructure
the Supervisor’s stuck-scan ordering is the recurring-only subset used for stuck detection

That keeps this PR focused on liveness/recovery. Making registry membership itself follow recurring lifecycle would require broader changes across enable / disable / recovery flows.

Out Of Scope

recurring off -> on re-enable for a pruned vault; explicit rejoin support will land in a follow-up PR
@holyfuchs review suggestion to make the registry itself contain only currently scheduled / recurring vaults instead of keeping a broader global registry plus a separate recurring-only stuck-scan state

Verification

flow test cadence/tests/scheduler_mixed_population_regression_test.cdc
flow test cadence/tests/scheduled_supervisor_test.cdc
duplicate recovery churn is rejected by the strengthened supervisor stress test
mixed recurring / non-recurring tail populations are verified not to starve real stuck recurring vault recovery

cadence/contracts/FlowYieldVaultsAutoBalancers.cdc

…-recovery

cadence/contracts/FlowYieldVaultsAutoBalancers.cdc

…an-recovery

liobrasil added 3 commits March 10, 2026 22:19

test: tighten supervisor recovery regressions

8d609a4

fix: avoid duplicate supervisor recoveries

8447480

fix: restrict supervisor stuck scan to recurring vaults

66ccb02

liobrasil requested a review from a team as a code owner March 11, 2026 04:56

liobrasil requested a review from holyfuchs March 11, 2026 05:16

liobrasil mentioned this pull request Mar 12, 2026

Fix supervisor: report vault execution so stuck-scan order isn't fixed #187

Merged

liobrasil requested review from Kay-Zee and jordanschalm March 12, 2026 17:26

nvdtf requested changes Mar 13, 2026

View reviewed changes

cadence/contracts/FlowYieldVaultsAutoBalancers.cdc Show resolved Hide resolved

Merge origin/holyfuchs/supervisor-fix into lionel/fix-supervisor-scan…

88ae670

…-recovery

holyfuchs reviewed Mar 18, 2026

View reviewed changes

cadence/contracts/FlowYieldVaultsAutoBalancers.cdc Show resolved Hide resolved

cadence/contracts/FlowYieldVaultsAutoBalancers.cdc Show resolved Hide resolved

docs: clarify scheduler registry semantics

9284c7e

liobrasil force-pushed the lionel/fix-supervisor-scan-recovery branch from 67c0e5e to eab7ad0 Compare March 19, 2026 16:28

fix: bound optimistic execution recovery window

999cd1d

liobrasil force-pushed the lionel/fix-supervisor-scan-recovery branch from eab7ad0 to 999cd1d Compare March 19, 2026 18:22

liobrasil added 6 commits March 19, 2026 14:26

fix: bound supervisor stuck-scan pruning work

5959f87

test: fix mixed-population supervisor regression

d773f55

fix: restore manual deferred redeem claim retry

82cdec2

docs: align scheduler docs with scan semantics

f4c8e16

Merge branch 'holyfuchs/supervisor-fix' into lionel/fix-supervisor-sc…

6122a85

…an-recovery

test: align mixed-population regression comments

6ef8cc8

liobrasil requested review from a team and nvdtf March 19, 2026 21:03

liobrasil added 2 commits March 19, 2026 17:23

fix: clarify supervisor stuck-scan mutation semantics

66d5eb8

Tighten scheduler recovery grace and docs

8ffc8c7

Base automatically changed from holyfuchs/supervisor-fix to main March 20, 2026 07:11

liobrasil added 2 commits March 23, 2026 20:50

Merge branch 'main' into lionel/fix-supervisor-scan-recovery

28a29ae

docs: note recurring re-enable follow-up

7626b37

liobrasil requested a review from holyfuchs March 24, 2026 05:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden supervisor recovery and stuck scan#207

fix: harden supervisor recovery and stuck scan#207
liobrasil wants to merge 16 commits intomainfrom
lionel/fix-supervisor-scan-recovery

liobrasil commented Mar 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liobrasil commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why This PR Is Needed

1. Duplicate recovery on healthy recurring vaults

2. Starvation in bounded stuck-scan

Before / After

Before

After

Scope / Semantics

Out Of Scope

Verification

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

liobrasil commented Mar 11, 2026 •

edited

Loading