db/state: fix Aggregator & ForkableAgg Close vs background MergeLoop WaitGroup race#21528
Merged
Merged
Conversation
BuildFiles2 and buildFilesInBackground spawn the merge goroutine inside the build goroutine, so MergeLoop's wg.Add could run after the build goroutine's wg.Done — a positive Add from a zero counter, concurrent with Close's wg.Wait (sync.WaitGroup reuse, flagged by -race). Register the merge goroutine on wg before spawning it so that Add happens-before the build goroutine's Done and the counter never reaches zero with a merge pending. MergeLoop is split into a self-accounting public wrapper (kept for the external callers) and a private mergeLoop body used by the in-file background sites. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a sync.WaitGroup reuse race between Aggregator.Close and the background merge goroutines spawned by BuildFiles2 / buildFilesInBackground. Previously, wg.Add(1) was performed inside MergeLoop itself (i.e., on the merge goroutine), which could be scheduled after the parent build goroutine's wg.Done() had already dropped the counter to zero, racing Close's wg.Wait().
Changes:
- Move
a.wg.Add(1)to before the merge goroutine is spawned in bothBuildFiles2andbuildFilesInBackground, and call a new privatemergeLoopbody. - Split
MergeLoopinto a public wrapper (preserving the contract for external callers likesnapshots_cmd,kv_temporal,backend) that does its ownwg.Add(1)/Done, and a privatemergeLoopbody used by the in-file background spawn sites. - Add
db/state/aggregator_close_test.goreproducing the race against an empty aggregator over 64 iterations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| db/state/aggregator.go | Pre-register merge goroutines on wg before spawning; split MergeLoop into self-accounting public wrapper + private mergeLoop body. |
| db/state/aggregator_close_test.go | New regression test driving BuildFiles2(..., doMerge=true) and asserting wg.Wait/Close ordering is race-free. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Twin of the Aggregator fix on this branch. ForkableAgg.BuildFilesInBackground spawned its merge goroutine without registering on wg first: the wg.Add(1) lived inside MergeLoop, on the merge goroutine, which can be scheduled after the build goroutine's wg.Done has dropped the counter to zero — an Add-from-zero concurrent with Close's wg.Wait, i.e. sync.WaitGroup reuse. ForkableAgg.Close goes straight to ctxCancel + wg.Wait with no WaitForFiles pre-drain, so it relies on wg.Wait alone to drain in-flight build/merge. Same fix as Aggregator: register the merge goroutine on wg before spawning it, and split MergeLoop into a self-accounting public wrapper plus a private mergeLoop body used by the in-file background spawn site. TDD: TestForkableAggCloseWaitsForBackgroundMerge drives BuildFilesInBackground then wg.Wait over an empty aggregator (one no-op forkable). Red reproduced the exact "WaitGroup is reused before previous Wait has returned" panic plus the -race report; Green is deterministic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AskAlexSharov
approved these changes
May 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #21521 — the data race between
Aggregator.Close(wg.Wait) and the backgroundMergeLoopgoroutine spawned byBuildFiles2/buildFilesInBackground. The siblingForkableAggcarried the identical latent race and gets the same fix here.Root cause
Both background-build sites spawn the merge goroutine inside the build goroutine:
MergeLoop'swg.Add(1)runs in the merge goroutine, which can be scheduled after the build goroutine'swg.Done()has already dropped the counter to zero. A positiveAddfrom a zero counter, concurrent withClose'swg.Wait(), issync.WaitGroupreuse — undefined behavior, flagged by-raceand by the runtime's ownpanic: sync: WaitGroup is reused before previous Wait has returned.It is pre-existing (the structure predates #21499, which only touched
commitment_context.go); CI load just made the merge goroutine starve long enough to surface it.Fix
Register the merge goroutine on
wgbefore spawning it, so theAddhappens-before the build goroutine'sDoneand the counter never reaches zero while a merge is pending.MergeLoopis split into a self-accounting public wrapper (unchanged contract for the external callers —snapshots_cmd,kv_temporal,backend, tests) and a privatemergeLoopbody used by the two in-file background sites.Test (TDD)
db/state/aggregator_close_test.godrives the realBuildFiles2(…, doMerge=true)thenwg.Wait()(whatClosedoes ataggregator.go:541). The empty-aggregator merge is a clean no-op that touches onlyagg.wg, so the only possible race is the reported one.MergeLoopwg.Addataggregator.go:1113viaBuildFiles2.func1.1at:1057racing thewg.Wait, plus the runtimeWaitGroup is reusedpanic.1→2→1→0, no reuse.Note on
buildFilesInBackgroundbuildFilesInBackground(aggregator.go:2155) has the identical bug and gets the identical fix, but isn't covered by an isolated test: its early-return guards (hasData, committed-txNum) mean an empty aggregator never reaches the merge spawn, so a focused test would need heavy real-data setup. The fulldb/state-racesuite, which exercises that path, stays green.ForkableAgg twin (same defect)
ForkableAgg.BuildFilesInBackground(forkable_agg.go) mirrors the same structure and carried the same race: the merge goroutine was spawned with no precedingwg.Add, andForkableAgg.MergeLoopdid its ownwg.Add(1)on the merge goroutine — racingForkableAgg.Close'swg.Wait().ForkableAgg.Closegoes straight toctxCancel+wg.Waitwith noWaitForFiles()pre-drain (unlikeAggregator.Close), so it relies onwg.Waitalone to drain — if anything, more exposed.Same fix (pre-register on
wgbefore the spawn;MergeLoop→ self-accounting public wrapper + privatemergeLoop). UnlikebuildFilesInBackground, this path is covered by an isolated test:TestForkableAggCloseWaitsForBackgroundMergedrivesBuildFilesInBackgroundthenwg.Waitover an empty aggregator (one no-op forkable). Red reproduced the exactWaitGroup is reusedpanic plus the-racereport on the first run; Green is deterministic.Verification
go test -race -count=3 -run TestAggregatorCloseWaitsForBackgroundMerge ./db/state/— passgo test -race -count=3 -run TestForkableAggCloseWaitsForBackgroundMerge ./db/state/— passgo test -race ./db/state/— pass (no race)go test -race -run 'Merge' ./db/test/(directMergeLoopcallers) — passmake lint— 0 issuesmake erigon integration— both binaries build🤖 Generated with Claude Code