Abandon alarms that are deleted due too many retries#6396
Abandon alarms that are deleted due too many retries#6396
Conversation
4fa841e to
f7af926
Compare
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR adds an abandonAlarm RPC call and implementations in ActorCache/ActorSqlite to clear stale alarm state after AlarmManager gives up retrying, gated behind an autogate.
Issues found (ranked by severity):
-
[HIGH] Race condition:
scheduledTimeis accepted but never checked -- BothActorCache::abandonAlarmandActorSqlite::abandonAlarmmarkscheduledTimeas[[maybe_unused]]. If the user sets a new alarm between the last failed retry and theabandonAlarmcall, and that new alarm has already flushed to CLEAN status,abandonAlarmwill incorrectly clear it. The time should be compared before clearing. -
[MEDIUM]
AlarmScheduler(localworkerd servepath) does not callabandonAlarm--AlarmScheduler::makeAlarmTaskinalarm-scheduler.c++callsdeleteAlarmwhencountedRetry >= RETRY_MAX_TRIES(line 239-240) but never callsabandonAlarmon the actor. The same stale-cache bug this PR fixes for production will still occur in local development. Consider adding anabandonAlarmcall beforedeleteAlarmin that code path. -
[LOW] Missing
KJ_DEFERfor autogate cleanup in "without fix" tests -- The tests that demonstrate the bug calldeinitAutogate()at the top but don't restore state on exit withKJ_DEFER, unlike the other tests.
These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.
|
Review posted successfully on PR #6396. Here's a summary of the findings:
|
fa8875c to
3d378c5
Compare
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
All three issues from my previous review are addressed: the race condition is fixed with proper time comparison, AlarmScheduler now calls abandonAlarm, and the autogate gating was removed entirely. The code looks correct and the test coverage is thorough (basic abandon, uncounted-failure preservation, and the race-condition scenario).
One minor nit below.
These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.
|
Review posted on PR #6396. Summary:
|
2a5aac6 to
dc28564
Compare
|
There's a double blank line (799, 800) before the trailing comment. That's a minor formatting nit, not actionable. Let me do a final comprehensive review. The code is clean, the race condition is handled correctly, tests are thorough, and the integration points are all wired up. The only thing I'm checking is whether the previous review's stale comment was actually addressed: The previous review (second round) mentioned: "One minor nit: a stale capnp comment referencing the removed autogate". Looking at the current capnp, lines 795-798, there's no autogate reference. So either it was fixed, or the previous reviewer was wrong. Either way, it's clean now. The code looks good. No actionable issues found. LGTM |
1b6a320 to
3a39b40
Compare
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
The core logic in ActorCache and ActorSqlite looks correct: the race condition is handled, the time comparison is solid, tests are thorough. However, the AlarmScheduler integration has two issues that prevent it from working in the local workerd serve path.
- [HIGH] Compilation failure in
alarm-scheduler.c++:233—namespacesandactorRef.uniqueKeydon't exist. See inline comment. - [MEDIUM]
PromisedWorkerInterfaceandLazyWorkerInterfacedon't forwardabandonAlarm— Even after fixing #1, thegetActorlambda inserver.c++:2254wraps the returnedWorkerInterfaceinnewPromisedWorkerInterface()(defined atworker-interface.c++:19). That class overridesrequest,connect,prewarm,runScheduled,runAlarm, andcustomEvent— but notabandonAlarm. SinceWorkerInterface::abandonAlarmhas a default no-op return, the call silently succeeds without ever reachingWorkerEntrypoint::abandonAlarm.LazyWorkerInterface(worker-interface.h:189) has the same gap. Both need anabandonAlarmoverride following the same pattern asrunAlarm.
These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.
|
Review posted successfully on PR #6396. Here's a summary of the findings:
|
|
The generated output of |
3a39b40 to
289db92
Compare
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
All issues from prior reviews are resolved: the race condition is handled with time comparison, PromisedWorkerInterface and LazyWorkerInterface forward abandonAlarm, and the AlarmScheduler properly calls abandonAlarm before deleteAlarm. Tests cover the three key scenarios (basic abandon, uncounted-failure preservation, and the race-condition guard). The code is clean.
One low-severity gap:
-
[LOW]
RequestObserverWithTracerdoes not forwardabandonAlarm—RequestObserverWithTracerinserver.c++:1580wrapsWorkerEntrypointand overridesrequest,connect,prewarm,runScheduled,runAlarm,test, andcustomEvent— but notabandonAlarm. When tracing is active inworkerd serve, the wrapper'swrapWorkerInterfacereturns itself (line 1600), and theAlarmSchedulercallsabandonAlarmon this wrapper, which falls through to the base-class no-op instead of reachingWorkerEntrypoint::abandonAlarm. The fix doesn't take effect in this narrow scenario (local dev with tail workers configured). The fallback behavior is the same as before this PR, so impact is minimal.Suggested fix — add to
RequestObserverWithTracer(after line 1693 inserver.c++):kj::Promise<void> abandonAlarm(kj::Date scheduledTime) override { return KJ_ASSERT_NONNULL(inner).abandonAlarm(scheduledTime); }
These comments were generated by an AI code review assistant and may not be perfect. Please verify the suggestions before applying them.
|
Review posted successfully on PR #6396. Here's a summary of the findings:
|
289db92 to
318ab83
Compare
This method should be called when abandoning an alarm, without having it successfuly run
Abandons alarms when the maximum amount of retries is reached
Abandons sqlite alarms that hit the maximum retries
318ab83 to
91014d2
Compare
|
I now have a thorough understanding of the PR. Let me review the key points:
The code looks clean and all previously flagged issues from earlier reviews have been resolved. I don't see any actionable issues. LGTM |
| kj::Promise<void> ActorSqlite::abandonAlarm(kj::Date scheduledTime) { | ||
| // Called when AlarmManager has given up retrying an alarm after too many counted failures. | ||
| // Clear the alarm from SQLite so getAlarm() returns null instead of a stale time. | ||
| // Only clear if SQLite currently has the exact alarm being abandoned and we're not mid-handler. |
There was a problem hiding this comment.
How much have we thought through how this interacts with the general split-brain nature of alarms stored in sqlite vs in our AlarmManager system?
Currently, the invariant that we attempt to maintain is that the scheduledTime in the sqlite DB is >= the scheduled time in the backend AlarmManager, such that we're always guaranteed to be woken up before the scheduled time in sqlite.
But the fact that the two can get out of sync makes this look very fishy, since what's stopping a scenario where the time being abandoned is earlier than our time in sqlite, so we don't clear the time in sqlite, and then we're left with a time in sqlite but no time in the upstream AlarmManager (and thus we'd still potentially be telling callers of getAlarm() that an alarm is set when an alarm will never actually be invoked).
I think it means this is an incomplete fix that will still allow some DOs to get stuck in the state that we're attempting to fix with this change. But I'll dig in a bit more to try to confirm.
There was a problem hiding this comment.
Yeah, opus 4.6 agrees this is a problem: https://share.opencode.cloudflare.dev/share/ad9z88O5
Its analysis (in the second message) looks correct to me. This is a real problem, at least in the case where sqlite's persisted alarm time is in the past. But it didn't give a great idea for a fix. Its proposal to clear the alarm if sqlite's scheduledTime is less than the current time is pretty good (maybe good enough?), although still not perfect.
Alternatively, we could try returning sqlite's scheduled time back in the response to the abandonAlarm RPC such that AlarmManager can update its stored scheduled time if appropriate (i.e. if it hadn't already been updated via a separate concurrent call from setAlarm). A bit more context about that idea is discussed in the fourth message of that chat session.
What do you think?
When we delete an alarm because it retried too many times, we were not removing it from actor-cache or actor-sqlite.
This PR adds a new rpc call to abandonAlarm. This can be used to clear the actor-cache and actor-sqlite state when we delete an alarm that never successfully ran.