Skip to content

adapter: fix statement-logging double-end panics on concurrent DROP CLUSTER#36943

Draft
ggevay wants to merge 2 commits into
MaterializeInc:mainfrom
ggevay:statement-logging-double-end-fix
Draft

adapter: fix statement-logging double-end panics on concurrent DROP CLUSTER#36943
ggevay wants to merge 2 commits into
MaterializeInc:mainfrom
ggevay:statement-logging-double-end-fix

Conversation

@ggevay

@ggevay ggevay commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Motivation

A frontend-sequenced peek whose cluster is dropped concurrently could log its
statement-execution end twice, panicking the coordinator at
end_statement_execution and aborting environmentd. Two distinct races hit the
identical panic — one fast-path, one slow-path — both fixed here.

This seems to have happened in #incident-1070.

Description

Two commits, one per path — reviewable independently, apart from the two tests'
cross-references to each other.

A frontend peek must be the sole end-logger unless the coordinator already has a
genuine end of its own. The fast and slow paths break this differently — in one
the coordinator's extra end is real, in the other it is spurious — so each is
fixed differently:

  • Fast path: DROP CLUSTER in the RegisterFrontendPeekclient.peek()
    window makes the coordinator retire the peek with a real end (Canceled)
    while the frontend also logs Errored. Fix: UnregisterFrontendPeek reports
    whether the peek was still registered; if already retired, the frontend skips
    logging (end_already_logged).
  • Slow path: implement_peek_plan errors before taking its
    ExecuteContextGuard, so the guard's Drop emits a spurious Aborted on
    top of the frontend's Errored. Fix: defuse the guard on error.

Also documents implement_peek_plan's ctx_extra ownership contract.

Verification

Two new test/cluster workflows that fail on any environmentd abort: a
slow-path stress test, and a deterministic fast-path test using a new
peek_after_register_before_issue failpoint. Both pass with the fixes;
reverting the fast-path fix makes its workflow fail as expected.

Nightly: https://buildkite.com/materialize/nightly/builds/16727

🤖 Generated with Claude Code

@ggevay ggevay added the A-ADAPTER Topics related to the ADAPTER layer label Jun 9, 2026
ggevay and others added 2 commits June 9, 2026 19:12
A frontend-sequenced slow-path peek (`ExecuteSlowPathPeek`) that fails
before issuing — e.g. a concurrent `DROP CLUSTER` — returns from
`implement_peek_plan` without taking its `ExecuteContextGuard`. The
guard's `Drop` then emits a spurious `Aborted` end on top of the
`Errored` end the frontend logs, double-ending the statement and
panicking the coordinator in `end_statement_execution`.

Defuse the guard on error so the frontend is the sole end-logger, and
document `implement_peek_plan`'s `ctx_extra` ownership contract.

Add a `test/cluster` regression workflow,
`test-drop-cluster-during-registered-peeks`, that races slow-path peeks
against DROP/CREATE of their cluster.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A frontend-sequenced fast-path peek registers with the coordinator
(`RegisterFrontendPeek`) before issuing `client.peek()`. If a
`DROP CLUSTER` lands in that window, the coordinator retires the pending
peek and logs its end (`Canceled`), while the frontend logs the end again
(`Errored`) once `client.peek()` fails — double-ending the statement and
panicking the coordinator in `end_statement_execution`.

Make `UnregisterFrontendPeek` report whether the peek was still
registered; when a concurrent teardown already retired it, the frontend
skips logging the end (`end_already_logged`).

Add a deterministic `test/cluster` regression workflow,
`test-drop-cluster-during-registered-peeks-fast-path`, using the
`peek_after_register_before_issue` failpoint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ggevay ggevay force-pushed the statement-logging-double-end-fix branch from f6852e2 to 9425d6a Compare June 9, 2026 17:17
@ggevay ggevay marked this pull request as ready for review June 9, 2026 17:21
@ggevay ggevay requested a review from a team as a code owner June 9, 2026 17:21
@ggevay ggevay requested a review from aljoscha June 9, 2026 17:21
@ggevay ggevay marked this pull request as draft June 10, 2026 14:13
ggevay added a commit that referenced this pull request Jun 11, 2026
Ending the same statement execution twice indicates a bug, but panicking
on it aborts the whole environmentd process, turning a statement-logging
bookkeeping error into an outage. Report it with soft_panic_or_log
instead and keep the first end: still a hard panic in CI and debug
builds, an error log in production.

Mitigates
[#incident-1070](https://materializeinc.slack.com/archives/C0B8QU8FGCR/p1781095240138559?thread_ts=1780989614.408819&cid=C0B8QU8FGCR)
before the proper fixes land
(#36943 and/or
#36956).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ADAPTER Topics related to the ADAPTER layer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant