Abandon alarms that are deleted due too many retries by jqmmes · Pull Request #6396 · cloudflare/workerd

jqmmes · 2026-03-24T11:54:16Z

When we delete an alarm because it retried too many times, we were not removing it from actor-cache or actor-sqlite.
This PR adds a new rpc call to abandonAlarm. This can be used to clear the actor-cache and actor-sqlite state when we delete an alarm that never successfully ran.

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

This PR adds an abandonAlarm RPC call and implementations in ActorCache/ActorSqlite to clear stale alarm state after AlarmManager gives up retrying, gated behind an autogate.

Issues found (ranked by severity):

[HIGH] Race condition: scheduledTime is accepted but never checked -- Both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm mark scheduledTime as [[maybe_unused]]. If the user sets a new alarm between the last failed retry and the abandonAlarm call, and that new alarm has already flushed to CLEAN status, abandonAlarm will incorrectly clear it. The time should be compared before clearing.
[MEDIUM] AlarmScheduler (local workerd serve path) does not call abandonAlarm -- AlarmScheduler::makeAlarmTask in alarm-scheduler.c++ calls deleteAlarm when countedRetry >= RETRY_MAX_TRIES (line 239-240) but never calls abandonAlarm on the actor. The same stale-cache bug this PR fixes for production will still occur in local development. Consider adding an abandonAlarm call before deleteAlarm in that code path.
[LOW] Missing KJ_DEFER for autogate cleanup in "without fix" tests -- The tests that demonstrate the bug call deinitAutogate() at the top but don't restore state on exit with KJ_DEFER, unlike the other tests.

These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

src/workerd/io/actor-cache.c++

src/workerd/io/actor-sqlite.c++

src/workerd/io/actor-cache-test.c++

src/workerd/io/actor-sqlite-test.c++

ask-bonk · 2026-03-24T11:59:48Z

Review posted successfully on PR #6396. Here's a summary of the findings:

[HIGH] Race condition in both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm -- The scheduledTime parameter is marked [[maybe_unused]] and never compared against the current alarm time. If a user sets a new alarm between the last failed retry and the abandonAlarm call, and that alarm has flushed to CLEAN status, it will be incorrectly cleared. Suggested fix: compare the cached time against scheduledTime before clearing.
[MEDIUM] AlarmScheduler not updated -- The local workerd serve path in alarm-scheduler.c++ calls deleteAlarm when retries are exhausted but never calls abandonAlarm on the actor, so the same stale-cache bug will persist in local dev.
[LOW] Missing KJ_DEFER cleanup -- The "without fix" tests call deinitAutogate() without KJ_DEFER, unlike the other tests that properly clean up.

github run

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

All three issues from my previous review are addressed: the race condition is fixed with proper time comparison, AlarmScheduler now calls abandonAlarm, and the autogate gating was removed entirely. The code looks correct and the test coverage is thorough (basic abandon, uncounted-failure preservation, and the race-condition scenario).

One minor nit below.

These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

src/workerd/io/worker-interface.capnp

ask-bonk · 2026-03-25T14:56:56Z

Review posted on PR #6396. Summary:

All three previously flagged issues are resolved in this update
The race condition fix is solid: both ActorCache and ActorSqlite now compare scheduledTime before clearing
The AlarmScheduler now calls abandonAlarm before deleteAlarm
The autogate gating was removed entirely (simpler, cleaner)
Tests cover the three key scenarios well
One minor nit: a stale capnp comment referencing the removed autogate

github run

ask-bonk · 2026-03-25T16:51:15Z

There's a double blank line (799, 800) before the trailing comment. That's a minor formatting nit, not actionable.

Let me do a final comprehensive review. The code is clean, the race condition is handled correctly, tests are thorough, and the integration points are all wired up. The only thing I'm checking is whether the previous review's stale comment was actually addressed:

The previous review (second round) mentioned: "One minor nit: a stale capnp comment referencing the removed autogate". Looking at the current capnp, lines 795-798, there's no autogate reference. So either it was fixed, or the previous reviewer was wrong. Either way, it's clean now.

The code looks good. No actionable issues found.

LGTM

github run

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

The core logic in ActorCache and ActorSqlite looks correct: the race condition is handled, the time comparison is solid, tests are thorough. However, the AlarmScheduler integration has two issues that prevent it from working in the local workerd serve path.

[HIGH] Compilation failure in alarm-scheduler.c++:233 — namespaces and actorRef.uniqueKey don't exist. See inline comment.
[MEDIUM] PromisedWorkerInterface and LazyWorkerInterface don't forward abandonAlarm — Even after fixing #1, the getActor lambda in server.c++:2254 wraps the returned WorkerInterface in newPromisedWorkerInterface() (defined at worker-interface.c++:19). That class overrides request, connect, prewarm, runScheduled, runAlarm, and customEvent — but not abandonAlarm. Since WorkerInterface::abandonAlarm has a default no-op return, the call silently succeeds without ever reaching WorkerEntrypoint::abandonAlarm. LazyWorkerInterface (worker-interface.h:189) has the same gap. Both need an abandonAlarm override following the same pattern as runAlarm.

These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

src/workerd/server/alarm-scheduler.c++

ask-bonk · 2026-03-26T17:22:43Z

Review posted successfully on PR #6396. Here's a summary of the findings:

[HIGH] Compilation failure in alarm-scheduler.c++:233 — The new code references namespaces.find(actorRef.uniqueKey), but AlarmScheduler has no namespaces member and ActorKey has no uniqueKey field. This code will not compile. The fix is to use getActor(kj::str(actorRef.actorId)) directly, matching the existing runAlarm pattern at line 153.
[MEDIUM] PromisedWorkerInterface and LazyWorkerInterface don't forward abandonAlarm — Even after fixing Rename BUILD -> BUILD.bazel #1, the getActor lambda in server.c++ wraps results in newPromisedWorkerInterface(), which doesn't override abandonAlarm. Since WorkerInterface::abandonAlarm has a default no-op, calls through these wrappers silently do nothing, never reaching WorkerEntrypoint::abandonAlarm. Both wrapper classes need an abandonAlarm override.

github run

github-actions · 2026-03-26T17:24:48Z

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

All issues from prior reviews are resolved: the race condition is handled with time comparison, PromisedWorkerInterface and LazyWorkerInterface forward abandonAlarm, and the AlarmScheduler properly calls abandonAlarm before deleteAlarm. Tests cover the three key scenarios (basic abandon, uncounted-failure preservation, and the race-condition guard). The code is clean.

One low-severity gap:

[LOW] RequestObserverWithTracer does not forward abandonAlarm — RequestObserverWithTracer in server.c++:1580 wraps WorkerEntrypoint and overrides request, connect, prewarm, runScheduled, runAlarm, test, and customEvent — but not abandonAlarm. When tracing is active in workerd serve, the wrapper's wrapWorkerInterface returns itself (line 1600), and the AlarmScheduler calls abandonAlarm on this wrapper, which falls through to the base-class no-op instead of reaching WorkerEntrypoint::abandonAlarm. The fix doesn't take effect in this narrow scenario (local dev with tail workers configured). The fallback behavior is the same as before this PR, so impact is minimal.

Suggested fix — add to RequestObserverWithTracer (after line 1693 in server.c++):
```
kj::Promise<void> abandonAlarm(kj::Date scheduledTime) override {
  return KJ_ASSERT_NONNULL(inner).abandonAlarm(scheduledTime);
}
```

These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.

ask-bonk · 2026-03-26T17:46:54Z

Review posted successfully on PR #6396. Here's a summary of the findings:

All previously flagged issues (race condition, PromisedWorkerInterface/LazyWorkerInterface forwarding, AlarmScheduler integration, compilation error) are resolved in this update
The core logic in ActorCache::abandonAlarm and ActorSqlite::abandonAlarm is correct: time comparison guards against the race, status checks protect DIRTY/FLUSHING states, and DeferredAlarmDelete is naturally excluded by tryGet<KnownAlarmTime>()
Tests are thorough, covering the three key scenarios
One [LOW] issue: RequestObserverWithTracer in server.c++ doesn't forward abandonAlarm, so the fix won't take effect in the narrow case of workerd serve with tracing configured. Suggested a one-line fix.

github run

This method should be called when abandoning an alarm, without having it successfuly run

Abandons alarms when the maximum amount of retries is reached

Abandons sqlite alarms that hit the maximum retries

ask-bonk · 2026-03-26T18:33:12Z

I now have a thorough understanding of the PR. Let me review the key points:

Core logic is sound: Both ActorCache::abandonAlarm and ActorSqlite::abandonAlarm properly guard against the race condition with time comparison.
All wrapper classes forward properly: PromisedWorkerInterface, LazyWorkerInterface, RpcWorkerInterface, and RequestObserverWithTracer all override abandonAlarm.
Cap'n Proto schema: New @11 ordinal is correct (no conflicts with existing ordinals).
Tests are thorough: Three test scenarios each for both ActorCache and ActorSqlite (basic abandon, uncounted failures, race condition).
Build system changes are correct: Autogate dep moved from :actor to :actor-sqlite-test where it's still needed by pre-existing tests.
WorkerEntrypoint::abandonAlarm: Uses a non-destructive read of incomingRequest (KJ_IF_SOME without kj::mv), which is correct since this method is a pure state cleanup that doesn't need to take ownership. Each call from the alarm scheduler creates a fresh WorkerInterface.

The code looks clean and all previously flagged issues from earlier reviews have been resolved. I don't see any actionable issues.

LGTM

github run

a-robinson · 2026-03-26T20:43:25Z

src/workerd/io/actor-sqlite.c++

+kj::Promise<void> ActorSqlite::abandonAlarm(kj::Date scheduledTime) {
+  // Called when AlarmManager has given up retrying an alarm after too many counted failures.
+  // Clear the alarm from SQLite so getAlarm() returns null instead of a stale time.
+  // Only clear if SQLite currently has the exact alarm being abandoned and we're not mid-handler.


How much have we thought through how this interacts with the general split-brain nature of alarms stored in sqlite vs in our AlarmManager system?

Currently, the invariant that we attempt to maintain is that the scheduledTime in the sqlite DB is >= the scheduled time in the backend AlarmManager, such that we're always guaranteed to be woken up before the scheduled time in sqlite.

But the fact that the two can get out of sync makes this look very fishy, since what's stopping a scenario where the time being abandoned is earlier than our time in sqlite, so we don't clear the time in sqlite, and then we're left with a time in sqlite but no time in the upstream AlarmManager (and thus we'd still potentially be telling callers of getAlarm() that an alarm is set when an alarm will never actually be invoked).

I think it means this is an incomplete fix that will still allow some DOs to get stuck in the state that we're attempting to fix with this change. But I'll dig in a bit more to try to confirm.

Yeah, opus 4.6 agrees this is a problem: https://share.opencode.cloudflare.dev/share/ad9z88O5

Its analysis (in the second message) looks correct to me. This is a real problem, at least in the case where sqlite's persisted alarm time is in the past. But it didn't give a great idea for a fix. Its proposal to clear the alarm if sqlite's scheduledTime is less than the current time is pretty good (maybe good enough?), although still not perfect.

Alternatively, we could try returning sqlite's scheduled time back in the response to the abandonAlarm RPC such that AlarmManager can update its stored scheduled time if appropriate (i.e. if it hadn't already been updated via a separate concurrent call from setAlarm). A bit more context about that idea is discussed in the fourth message of that chat session.

What do you think?

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 4fa841e to f7af926 Compare March 24, 2026 11:55

ask-bonk bot reviewed Mar 24, 2026

View reviewed changes

src/workerd/io/actor-cache.c++ Outdated Show resolved Hide resolved

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

src/workerd/io/actor-cache-test.c++ Outdated Show resolved Hide resolved

src/workerd/io/actor-sqlite-test.c++ Outdated Show resolved Hide resolved

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch 5 times, most recently from fa8875c to 3d378c5 Compare March 25, 2026 13:49

ask-bonk bot reviewed Mar 25, 2026

View reviewed changes

src/workerd/io/worker-interface.capnp Outdated Show resolved Hide resolved

jqmmes marked this pull request as ready for review March 25, 2026 15:12

jqmmes requested review from a team as code owners March 25, 2026 15:12

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch 2 times, most recently from 2a5aac6 to dc28564 Compare March 25, 2026 16:32

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch 2 times, most recently from 1b6a320 to 3a39b40 Compare March 26, 2026 17:12

ask-bonk bot reviewed Mar 26, 2026

View reviewed changes

src/workerd/server/alarm-scheduler.c++ Outdated Show resolved Hide resolved

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 3a39b40 to 289db92 Compare March 26, 2026 17:39

ask-bonk bot reviewed Mar 26, 2026

View reviewed changes

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 289db92 to 318ab83 Compare March 26, 2026 18:21

jqmmes added 5 commits March 26, 2026 18:29

Add a new abandomAlarm to WorkerInterface

46ebd01

This method should be called when abandoning an alarm, without having it successfuly run

Add abandonAlarm to actor-cache

593c674

Abandons alarms when the maximum amount of retries is reached

Update actor-cache-test with abandonAlarm

6ea8c07

Add abandonAlarm to actor-sqlite

495fce0

Abandons sqlite alarms that hit the maximum retries

Update actor-sqlite-test with abandonAlarm

bd4af29

jqmmes added 2 commits March 26, 2026 18:29

Add abandonAlarm to alarm-scheduler

58a4017

Drive-by remove unused autogate imports

91014d2

jqmmes force-pushed the joaquim/abandon-deleted-alarm branch from 318ab83 to 91014d2 Compare March 26, 2026 18:29

jasnell approved these changes Mar 26, 2026

View reviewed changes

a-robinson reviewed Mar 26, 2026

View reviewed changes

Conversation

jqmmes commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ask-bonk bot commented Mar 24, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ask-bonk bot commented Mar 25, 2026

Uh oh!

ask-bonk bot commented Mar 25, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ask-bonk bot commented Mar 26, 2026

Uh oh!

github-actions bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot commented Mar 26, 2026

Uh oh!

ask-bonk bot commented Mar 26, 2026

Uh oh!

a-robinson Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

a-robinson Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jqmmes commented Mar 24, 2026 •

edited

Loading

github-actions bot commented Mar 26, 2026 •

edited

Loading