Skip to content

db/state: fix Aggregator & ForkableAgg Close vs background MergeLoop WaitGroup race#21528

Merged
AskAlexSharov merged 2 commits into
mainfrom
yperbasis/aggregator-close-merge-race
May 31, 2026
Merged

db/state: fix Aggregator & ForkableAgg Close vs background MergeLoop WaitGroup race#21528
AskAlexSharov merged 2 commits into
mainfrom
yperbasis/aggregator-close-merge-race

Conversation

@yperbasis
Copy link
Copy Markdown
Member

@yperbasis yperbasis commented May 30, 2026

Summary

Fixes #21521 — the data race between Aggregator.Close (wg.Wait) and the background MergeLoop goroutine spawned by BuildFiles2 / buildFilesInBackground. The sibling ForkableAgg carried the identical latent race and gets the same fix here.

Root cause

Both background-build sites spawn the merge goroutine inside the build goroutine:

a.wg.Add(1)
go func() {            // build goroutine
    defer a.wg.Done()
    ... build ...
    if doMerge {
        go func() { a.MergeLoop(ctx) }()   // merge goroutine — MergeLoop does its own wg.Add
    }
}()                    // build goroutine returns → wg.Done

MergeLoop's wg.Add(1) runs in the merge goroutine, which can be scheduled after the build goroutine's wg.Done() has already dropped the counter to zero. A positive Add from a zero counter, concurrent with Close's wg.Wait(), is sync.WaitGroup reuse — undefined behavior, flagged by -race and by the runtime's own panic: sync: WaitGroup is reused before previous Wait has returned.

It is pre-existing (the structure predates #21499, which only touched commitment_context.go); CI load just made the merge goroutine starve long enough to surface it.

Fix

Register the merge goroutine on wg before spawning it, so the Add happens-before the build goroutine's Done and the counter never reaches zero while a merge is pending. MergeLoop is split into a self-accounting public wrapper (unchanged contract for the external callers — snapshots_cmd, kv_temporal, backend, tests) and a private mergeLoop body used by the two in-file background sites.

Test (TDD)

db/state/aggregator_close_test.go drives the real BuildFiles2(…, doMerge=true) then wg.Wait() (what Close does at aggregator.go:541). The empty-aggregator merge is a clean no-op that touches only agg.wg, so the only possible race is the reported one.

  • Red (before the fix): reproduces the exact trace — MergeLoop wg.Add at aggregator.go:1113 via BuildFiles2.func1.1 at :1057 racing the wg.Wait, plus the runtime WaitGroup is reused panic.
  • Green (after the fix): deterministic — counter path 1→2→1→0, no reuse.

Note on buildFilesInBackground

buildFilesInBackground (aggregator.go:2155) has the identical bug and gets the identical fix, but isn't covered by an isolated test: its early-return guards (hasData, committed-txNum) mean an empty aggregator never reaches the merge spawn, so a focused test would need heavy real-data setup. The full db/state -race suite, which exercises that path, stays green.

ForkableAgg twin (same defect)

ForkableAgg.BuildFilesInBackground (forkable_agg.go) mirrors the same structure and carried the same race: the merge goroutine was spawned with no preceding wg.Add, and ForkableAgg.MergeLoop did its own wg.Add(1) on the merge goroutine — racing ForkableAgg.Close's wg.Wait(). ForkableAgg.Close goes straight to ctxCancel + wg.Wait with no WaitForFiles() pre-drain (unlike Aggregator.Close), so it relies on wg.Wait alone to drain — if anything, more exposed.

Same fix (pre-register on wg before the spawn; MergeLoop → self-accounting public wrapper + private mergeLoop). Unlike buildFilesInBackground, this path is covered by an isolated test: TestForkableAggCloseWaitsForBackgroundMerge drives BuildFilesInBackground then wg.Wait over an empty aggregator (one no-op forkable). Red reproduced the exact WaitGroup is reused panic plus the -race report on the first run; Green is deterministic.

Verification

  • go test -race -count=3 -run TestAggregatorCloseWaitsForBackgroundMerge ./db/state/ — pass
  • go test -race -count=3 -run TestForkableAggCloseWaitsForBackgroundMerge ./db/state/ — pass
  • go test -race ./db/state/ — pass (no race)
  • go test -race -run 'Merge' ./db/test/ (direct MergeLoop callers) — pass
  • make lint — 0 issues
  • make erigon integration — both binaries build

🤖 Generated with Claude Code

BuildFiles2 and buildFilesInBackground spawn the merge goroutine inside the
build goroutine, so MergeLoop's wg.Add could run after the build goroutine's
wg.Done — a positive Add from a zero counter, concurrent with Close's wg.Wait
(sync.WaitGroup reuse, flagged by -race).

Register the merge goroutine on wg before spawning it so that Add happens-before
the build goroutine's Done and the counter never reaches zero with a merge
pending. MergeLoop is split into a self-accounting public wrapper (kept for the
external callers) and a private mergeLoop body used by the in-file background
sites.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a sync.WaitGroup reuse race between Aggregator.Close and the background merge goroutines spawned by BuildFiles2 / buildFilesInBackground. Previously, wg.Add(1) was performed inside MergeLoop itself (i.e., on the merge goroutine), which could be scheduled after the parent build goroutine's wg.Done() had already dropped the counter to zero, racing Close's wg.Wait().

Changes:

  • Move a.wg.Add(1) to before the merge goroutine is spawned in both BuildFiles2 and buildFilesInBackground, and call a new private mergeLoop body.
  • Split MergeLoop into a public wrapper (preserving the contract for external callers like snapshots_cmd, kv_temporal, backend) that does its own wg.Add(1)/Done, and a private mergeLoop body used by the in-file background spawn sites.
  • Add db/state/aggregator_close_test.go reproducing the race against an empty aggregator over 64 iterations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
db/state/aggregator.go Pre-register merge goroutines on wg before spawning; split MergeLoop into self-accounting public wrapper + private mergeLoop body.
db/state/aggregator_close_test.go New regression test driving BuildFiles2(..., doMerge=true) and asserting wg.Wait/Close ordering is race-free.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Twin of the Aggregator fix on this branch. ForkableAgg.BuildFilesInBackground
spawned its merge goroutine without registering on wg first: the wg.Add(1) lived
inside MergeLoop, on the merge goroutine, which can be scheduled after the build
goroutine's wg.Done has dropped the counter to zero — an Add-from-zero concurrent
with Close's wg.Wait, i.e. sync.WaitGroup reuse. ForkableAgg.Close goes straight
to ctxCancel + wg.Wait with no WaitForFiles pre-drain, so it relies on wg.Wait
alone to drain in-flight build/merge.

Same fix as Aggregator: register the merge goroutine on wg before spawning it,
and split MergeLoop into a self-accounting public wrapper plus a private mergeLoop
body used by the in-file background spawn site.

TDD: TestForkableAggCloseWaitsForBackgroundMerge drives BuildFilesInBackground
then wg.Wait over an empty aggregator (one no-op forkable). Red reproduced the
exact "WaitGroup is reused before previous Wait has returned" panic plus the
-race report; Green is deterministic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis changed the title db/state: fix Aggregator.Close vs background MergeLoop WaitGroup race db/state: fix Aggregator & ForkableAgg Close vs background MergeLoop WaitGroup race May 31, 2026
@yperbasis yperbasis marked this pull request as ready for review May 31, 2026 09:20
@AskAlexSharov AskAlexSharov added this pull request to the merge queue May 31, 2026
Merged via the queue into main with commit 2b65c6e May 31, 2026
105 of 165 checks passed
@AskAlexSharov AskAlexSharov deleted the yperbasis/aggregator-close-merge-race branch May 31, 2026 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race: Aggregator.Close vs MergeLoop background worker on test teardown

3 participants