adapter: fix statement-logging double-end panics on concurrent DROP CLUSTER#36943
Draft
ggevay wants to merge 2 commits into
Draft
adapter: fix statement-logging double-end panics on concurrent DROP CLUSTER#36943ggevay wants to merge 2 commits into
ggevay wants to merge 2 commits into
Conversation
A frontend-sequenced slow-path peek (`ExecuteSlowPathPeek`) that fails before issuing — e.g. a concurrent `DROP CLUSTER` — returns from `implement_peek_plan` without taking its `ExecuteContextGuard`. The guard's `Drop` then emits a spurious `Aborted` end on top of the `Errored` end the frontend logs, double-ending the statement and panicking the coordinator in `end_statement_execution`. Defuse the guard on error so the frontend is the sole end-logger, and document `implement_peek_plan`'s `ctx_extra` ownership contract. Add a `test/cluster` regression workflow, `test-drop-cluster-during-registered-peeks`, that races slow-path peeks against DROP/CREATE of their cluster. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A frontend-sequenced fast-path peek registers with the coordinator (`RegisterFrontendPeek`) before issuing `client.peek()`. If a `DROP CLUSTER` lands in that window, the coordinator retires the pending peek and logs its end (`Canceled`), while the frontend logs the end again (`Errored`) once `client.peek()` fails — double-ending the statement and panicking the coordinator in `end_statement_execution`. Make `UnregisterFrontendPeek` report whether the peek was still registered; when a concurrent teardown already retired it, the frontend skips logging the end (`end_already_logged`). Add a deterministic `test/cluster` regression workflow, `test-drop-cluster-during-registered-peeks-fast-path`, using the `peek_after_register_before_issue` failpoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f6852e2 to
9425d6a
Compare
ggevay
added a commit
that referenced
this pull request
Jun 11, 2026
Ending the same statement execution twice indicates a bug, but panicking on it aborts the whole environmentd process, turning a statement-logging bookkeeping error into an outage. Report it with soft_panic_or_log instead and keep the first end: still a hard panic in CI and debug builds, an error log in production. Mitigates [#incident-1070](https://materializeinc.slack.com/archives/C0B8QU8FGCR/p1781095240138559?thread_ts=1780989614.408819&cid=C0B8QU8FGCR) before the proper fixes land (#36943 and/or #36956). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
A frontend-sequenced peek whose cluster is dropped concurrently could log its
statement-execution end twice, panicking the coordinator at
end_statement_executionand aborting environmentd. Two distinct races hit theidentical panic — one fast-path, one slow-path — both fixed here.
This seems to have happened in #incident-1070.
Description
Two commits, one per path — reviewable independently, apart from the two tests'
cross-references to each other.
A frontend peek must be the sole end-logger unless the coordinator already has a
genuine end of its own. The fast and slow paths break this differently — in one
the coordinator's extra end is real, in the other it is spurious — so each is
fixed differently:
DROP CLUSTERin theRegisterFrontendPeek→client.peek()window makes the coordinator retire the peek with a real end (
Canceled)while the frontend also logs
Errored. Fix:UnregisterFrontendPeekreportswhether the peek was still registered; if already retired, the frontend skips
logging (
end_already_logged).implement_peek_planerrors before taking itsExecuteContextGuard, so the guard'sDropemits a spuriousAbortedontop of the frontend's
Errored. Fix: defuse the guard on error.Also documents
implement_peek_plan'sctx_extraownership contract.Verification
Two new
test/clusterworkflows that fail on any environmentd abort: aslow-path stress test, and a deterministic fast-path test using a new
peek_after_register_before_issuefailpoint. Both pass with the fixes;reverting the fast-path fix makes its workflow fail as expected.
Nightly: https://buildkite.com/materialize/nightly/builds/16727
🤖 Generated with Claude Code