Skip to content

Abandon alarms that are deleted due too many retries#6396

Open
jqmmes wants to merge 7 commits intomainfrom
joaquim/abandon-deleted-alarm
Open

Abandon alarms that are deleted due too many retries#6396
jqmmes wants to merge 7 commits intomainfrom
joaquim/abandon-deleted-alarm

Conversation

@jqmmes
Copy link
Contributor

@jqmmes jqmmes commented Mar 24, 2026

When we delete an alarm because it retried too many times, we were not removing it from actor-cache or actor-sqlite.
This PR adds a new rpc call to abandonAlarm. This can be used to clear the actor-cache and actor-sqlite state when we delete an alarm that never successfully ran.

@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 4fa841e to f7af926 Compare March 24, 2026 11:55
Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR adds an abandonAlarm RPC call and implementations in ActorCache/ActorSqlite to clear stale alarm state after AlarmManager gives up retrying, gated behind an autogate.

Issues found (ranked by severity):

  1. [HIGH] Race condition: scheduledTime is accepted but never checked -- Both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm mark scheduledTime as [[maybe_unused]]. If the user sets a new alarm between the last failed retry and the abandonAlarm call, and that new alarm has already flushed to CLEAN status, abandonAlarm will incorrectly clear it. The time should be compared before clearing.

  2. [MEDIUM] AlarmScheduler (local workerd serve path) does not call abandonAlarm -- AlarmScheduler::makeAlarmTask in alarm-scheduler.c++ calls deleteAlarm when countedRetry >= RETRY_MAX_TRIES (line 239-240) but never calls abandonAlarm on the actor. The same stale-cache bug this PR fixes for production will still occur in local development. Consider adding an abandonAlarm call before deleteAlarm in that code path.

  3. [LOW] Missing KJ_DEFER for autogate cleanup in "without fix" tests -- The tests that demonstrate the bug call deinitAutogate() at the top but don't restore state on exit with KJ_DEFER, unlike the other tests.


These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 24, 2026

Review posted successfully on PR #6396. Here's a summary of the findings:

  1. [HIGH] Race condition in both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm -- The scheduledTime parameter is marked [[maybe_unused]] and never compared against the current alarm time. If a user sets a new alarm between the last failed retry and the abandonAlarm call, and that alarm has flushed to CLEAN status, it will be incorrectly cleared. Suggested fix: compare the cached time against scheduledTime before clearing.

  2. [MEDIUM] AlarmScheduler not updated -- The local workerd serve path in alarm-scheduler.c++ calls deleteAlarm when retries are exhausted but never calls abandonAlarm on the actor, so the same stale-cache bug will persist in local dev.

  3. [LOW] Missing KJ_DEFER cleanup -- The "without fix" tests call deinitAutogate() without KJ_DEFER, unlike the other tests that properly clean up.

github run

@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch 5 times, most recently from fa8875c to 3d378c5 Compare March 25, 2026 13:49
Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

All three issues from my previous review are addressed: the race condition is fixed with proper time comparison, AlarmScheduler now calls abandonAlarm, and the autogate gating was removed entirely. The code looks correct and the test coverage is thorough (basic abandon, uncounted-failure preservation, and the race-condition scenario).

One minor nit below.


These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 25, 2026

Review posted on PR #6396. Summary:

  • All three previously flagged issues are resolved in this update
  • The race condition fix is solid: both ActorCache and ActorSqlite now compare scheduledTime before clearing
  • The AlarmScheduler now calls abandonAlarm before deleteAlarm
  • The autogate gating was removed entirely (simpler, cleaner)
  • Tests cover the three key scenarios well
  • One minor nit: a stale capnp comment referencing the removed autogate

github run

@jqmmes jqmmes marked this pull request as ready for review March 25, 2026 15:12
@jqmmes jqmmes requested review from a team as code owners March 25, 2026 15:12
@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch 2 times, most recently from 2a5aac6 to dc28564 Compare March 25, 2026 16:32
@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 25, 2026

There's a double blank line (799, 800) before the trailing comment. That's a minor formatting nit, not actionable.

Let me do a final comprehensive review. The code is clean, the race condition is handled correctly, tests are thorough, and the integration points are all wired up. The only thing I'm checking is whether the previous review's stale comment was actually addressed:

The previous review (second round) mentioned: "One minor nit: a stale capnp comment referencing the removed autogate". Looking at the current capnp, lines 795-798, there's no autogate reference. So either it was fixed, or the previous reviewer was wrong. Either way, it's clean now.

The code looks good. No actionable issues found.

LGTM

github run

@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch 2 times, most recently from 1b6a320 to 3a39b40 Compare March 26, 2026 17:12
Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

The core logic in ActorCache and ActorSqlite looks correct: the race condition is handled, the time comparison is solid, tests are thorough. However, the AlarmScheduler integration has two issues that prevent it from working in the local workerd serve path.

  1. [HIGH] Compilation failure in alarm-scheduler.c++:233namespaces and actorRef.uniqueKey don't exist. See inline comment.
  2. [MEDIUM] PromisedWorkerInterface and LazyWorkerInterface don't forward abandonAlarm — Even after fixing #1, the getActor lambda in server.c++:2254 wraps the returned WorkerInterface in newPromisedWorkerInterface() (defined at worker-interface.c++:19). That class overrides request, connect, prewarm, runScheduled, runAlarm, and customEvent — but not abandonAlarm. Since WorkerInterface::abandonAlarm has a default no-op return, the call silently succeeds without ever reaching WorkerEntrypoint::abandonAlarm. LazyWorkerInterface (worker-interface.h:189) has the same gap. Both need an abandonAlarm override following the same pattern as runAlarm.

These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 26, 2026

Review posted successfully on PR #6396. Here's a summary of the findings:

  1. [HIGH] Compilation failure in alarm-scheduler.c++:233 — The new code references namespaces.find(actorRef.uniqueKey), but AlarmScheduler has no namespaces member and ActorKey has no uniqueKey field. This code will not compile. The fix is to use getActor(kj::str(actorRef.actorId)) directly, matching the existing runAlarm pattern at line 153.

  2. [MEDIUM] PromisedWorkerInterface and LazyWorkerInterface don't forward abandonAlarm — Even after fixing Rename BUILD -> BUILD.bazel #1, the getActor lambda in server.c++ wraps results in newPromisedWorkerInterface(), which doesn't override abandonAlarm. Since WorkerInterface::abandonAlarm has a default no-op, calls through these wrappers silently do nothing, never reaching WorkerEntrypoint::abandonAlarm. Both wrapper classes need an abandonAlarm override.

github run

@github-actions
Copy link

github-actions bot commented Mar 26, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 3a39b40 to 289db92 Compare March 26, 2026 17:39
Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

All issues from prior reviews are resolved: the race condition is handled with time comparison, PromisedWorkerInterface and LazyWorkerInterface forward abandonAlarm, and the AlarmScheduler properly calls abandonAlarm before deleteAlarm. Tests cover the three key scenarios (basic abandon, uncounted-failure preservation, and the race-condition guard). The code is clean.

One low-severity gap:

  1. [LOW] RequestObserverWithTracer does not forward abandonAlarmRequestObserverWithTracer in server.c++:1580 wraps WorkerEntrypoint and overrides request, connect, prewarm, runScheduled, runAlarm, test, and customEvent — but not abandonAlarm. When tracing is active in workerd serve, the wrapper's wrapWorkerInterface returns itself (line 1600), and the AlarmScheduler calls abandonAlarm on this wrapper, which falls through to the base-class no-op instead of reaching WorkerEntrypoint::abandonAlarm. The fix doesn't take effect in this narrow scenario (local dev with tail workers configured). The fallback behavior is the same as before this PR, so impact is minimal.

    Suggested fix — add to RequestObserverWithTracer (after line 1693 in server.c++):

    kj::Promise<void> abandonAlarm(kj::Date scheduledTime) override {
      return KJ_ASSERT_NONNULL(inner).abandonAlarm(scheduledTime);
    }

These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 26, 2026

Review posted successfully on PR #6396. Here's a summary of the findings:

  • All previously flagged issues (race condition, PromisedWorkerInterface/LazyWorkerInterface forwarding, AlarmScheduler integration, compilation error) are resolved in this update
  • The core logic in ActorCache::abandonAlarm and ActorSqlite::abandonAlarm is correct: time comparison guards against the race, status checks protect DIRTY/FLUSHING states, and DeferredAlarmDelete is naturally excluded by tryGet<KnownAlarmTime>()
  • Tests are thorough, covering the three key scenarios
  • One [LOW] issue: RequestObserverWithTracer in server.c++ doesn't forward abandonAlarm, so the fix won't take effect in the narrow case of workerd serve with tracing configured. Suggested a one-line fix.

github run

@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 289db92 to 318ab83 Compare March 26, 2026 18:21
jqmmes added 5 commits March 26, 2026 18:29
This method should be called when abandoning an alarm, without having it successfuly run
Abandons alarms when the maximum amount of retries is reached
Abandons sqlite alarms that hit the maximum retries
@jqmmes jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 318ab83 to 91014d2 Compare March 26, 2026 18:29
@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 26, 2026

I now have a thorough understanding of the PR. Let me review the key points:

  1. Core logic is sound: Both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm properly guard against the race condition with time comparison.

  2. All wrapper classes forward properly: PromisedWorkerInterface, LazyWorkerInterface, RpcWorkerInterface, and RequestObserverWithTracer all override abandonAlarm.

  3. Cap'n Proto schema: New @11 ordinal is correct (no conflicts with existing ordinals).

  4. Tests are thorough: Three test scenarios each for both ActorCache and ActorSqlite (basic abandon, uncounted failures, race condition).

  5. Build system changes are correct: Autogate dep moved from :actor to :actor-sqlite-test where it's still needed by pre-existing tests.

  6. WorkerEntrypoint::abandonAlarm: Uses a non-destructive read of incomingRequest (KJ_IF_SOME without kj::mv), which is correct since this method is a pure state cleanup that doesn't need to take ownership. Each call from the alarm scheduler creates a fresh WorkerInterface.

The code looks clean and all previously flagged issues from earlier reviews have been resolved. I don't see any actionable issues.

LGTM

github run

kj::Promise<void> ActorSqlite::abandonAlarm(kj::Date scheduledTime) {
// Called when AlarmManager has given up retrying an alarm after too many counted failures.
// Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.
// Only clear if SQLite currently has the exact alarm being abandoned and we're not mid-handler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much have we thought through how this interacts with the general split-brain nature of alarms stored in sqlite vs in our AlarmManager system?

Currently, the invariant that we attempt to maintain is that the scheduledTime in the sqlite DB is >= the scheduled time in the backend AlarmManager, such that we're always guaranteed to be woken up before the scheduled time in sqlite.

But the fact that the two can get out of sync makes this look very fishy, since what's stopping a scenario where the time being abandoned is earlier than our time in sqlite, so we don't clear the time in sqlite, and then we're left with a time in sqlite but no time in the upstream AlarmManager (and thus we'd still potentially be telling callers of getAlarm() that an alarm is set when an alarm will never actually be invoked).

I think it means this is an incomplete fix that will still allow some DOs to get stuck in the state that we're attempting to fix with this change. But I'll dig in a bit more to try to confirm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, opus 4.6 agrees this is a problem: https://share.opencode.cloudflare.dev/share/ad9z88O5

Its analysis (in the second message) looks correct to me. This is a real problem, at least in the case where sqlite's persisted alarm time is in the past. But it didn't give a great idea for a fix. Its proposal to clear the alarm if sqlite's scheduledTime is less than the current time is pretty good (maybe good enough?), although still not perfect.

Alternatively, we could try returning sqlite's scheduled time back in the response to the abandonAlarm RPC such that AlarmManager can update its stored scheduled time if appropriate (i.e. if it hadn't already been updated via a separate concurrent call from setAlarm). A bit more context about that idea is discussed in the fourth message of that chat session.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants