Skip to content

[CI] refactor TestFollowerHappyPath to use synctest#8474

Open
peterargue wants to merge 4 commits intomasterfrom
peterargue/fix-follower-happy-path-pebble-panic
Open

[CI] refactor TestFollowerHappyPath to use synctest#8474
peterargue wants to merge 4 commits intomasterfrom
peterargue/fix-follower-happy-path-pebble-panic

Conversation

@peterargue
Copy link
Contributor

@peterargue peterargue commented Feb 27, 2026

Problem

TestFollowerHappyPath in engine/common/follower flakily panics with panic: pebble: closed when run under load alongside other engine tests. The panic manifests as a process crash rather than a clean test failure, obscuring what actually went wrong.

Reproduced by running go test ./engine/....

Root Cause

This test failed with several different failure cases (pebble panic, timeouts, etc). The root cause is the test takes a long time to run in the happy path (~30+ seconds). So under load from other concurrently running tests, it can take much longer. the test was regularly timing out after 90 second. This timeout would cause the test teardown logic to skip waiting for the components to finish shutting down, resulting in the pebble panics.

Fix

Switch the test to using synctest to reduce blocking caused by synchronization. Maintain the original test's randomness to ensure it continues to perform the same test.

Summary by CodeRabbit

  • Tests
    • Improved integration test reliability with coordinated startup/teardown orchestration and deterministic waits so goroutines settle before assertions.
    • Replaced polling-based waits with event-driven finalization signaling to wait for target conditions instead of busy-wait loops.
    • Simplified shutdown and cleanup with deterministic cancellation and coordinated waits for components to finish, reducing flakiness.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

Reworks the follower integration test to use synctest orchestration: wraps setup in a synctest bubble, uses an OnBlockFinalized consumer to signal completion via a finalized channel (replacing polling), switches readiness to non-blocking AllReady + synctest.Wait, and simplifies shutdown with channel-driven synchronization.

Changes

Cohort / File(s) Summary
Follower integration test
engine/common/follower/integration_test.go
Wrapped test startup in a synctest bubble and added runTestFollowerHappyPath; introduced a finalized channel and OnBlockFinalized consumer to signal finalization (replaces polling); replaced require-style readiness with non-blocking AllReady + synctest.Wait; simplified shutdown to stop producers, cancel context, and wait for loops using synctest.Wait; added synctest and HotStuff imports.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • zhangchiqing
  • fxamacker

Poem

🐰
I wrapped the test in a cozy synctest shell,
a finalization bell rang, no more polling spell,
goroutines settled, shutdown did glide,
the follower hummed, content inside,
carrots for all — hop, test passed well. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: refactoring TestFollowerHappyPath to use synctest for improved test synchronization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch peterargue/fix-follower-happy-path-pebble-panic

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@engine/common/follower/integration_test.go`:
- Line 220: Replace the fatal call to unittest.RequireReturnsBefore(wg.Wait,
...) inside the defer with a non-fatal timeout pattern so the defer can still
reach e.g. AllDone; specifically, remove the unittest.RequireReturnsBefore
invocation and instead spawn a goroutine that does wg.Wait() and signals a done
channel, then in the defer use select { case <-done: /* ok */ case
<-time.After(time.Second): t.Errorf("workers did not stop producing in time") }
so the defer continues and still calls AllDone (the defer's wait/cleanup logic
that uses AllDone must remain reachable).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65d2049 and e9ea929.

📒 Files selected for processing (1)
  • engine/common/follower/integration_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
engine/common/follower/integration_test.go (1)

247-257: ⚠️ Potential issue | 🟠 Major

Shutdown says workers are awaited, but wg.Wait() is never reached.

The comment says producers are waited out, but the code never waits on wg. This can leave producer goroutines still calling engine.OnSyncedBlocks while shutdown is in progress.

🔧 Proposed fix
 		// stop producers and wait for them to exit, then wait for engine shutdown.
 		submittingBlocks.Store(false)
+		workersDone := make(chan struct{})
+		go func() {
+			defer close(workersDone)
+			wg.Wait()
+		}()
 		cancel()

 		allDone := moduleutil.AllDone(engine, followerLoop)
 		synctest.Wait()
+		select {
+		case <-workersDone:
+		default:
+			t.Errorf("workers did not stop producing")
+		}
 		select {
 		case <-allDone:
 		default:
 			t.Fatal("engine failed to stop")
 		}

As per coding guidelines: "Use proper synchronization primitives to maintain state consistency, especially in concurrent or distributed contexts."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/common/follower/integration_test.go` around lines 247 - 257, The
shutdown sequence stops producers by setting submittingBlocks and calling
cancel(), but it never waits for the producer WaitGroup (wg) to finish, allowing
producers to still call engine.OnSyncedBlocks during shutdown; update the
sequence in the test to call wg.Wait() (the WaitGroup that tracks producer
goroutines) after submittingBlocks.Store(false) and cancel() and before calling
synctest.Wait() and checking moduleutil.AllDone(engine, followerLoop), so
producers are fully stopped prior to asserting engine shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@engine/common/follower/integration_test.go`:
- Around line 163-167: The select that checks <-allReady currently calls
t.Fatal("engine failed to start") which aborts the test immediately and prevents
the later cancel() from running; change that to t.Errorf("engine failed to
start") followed by a controlled early return (or t.FailNow replacement with a
defer-aware exit) so the test function can reach the cancel() cleanup. Locate
the select using the allReady channel and replace the t.Fatal call with t.Errorf
plus an immediate return to preserve graceful shutdown of the mock context and
ensure cancel() is invoked.

---

Duplicate comments:
In `@engine/common/follower/integration_test.go`:
- Around line 247-257: The shutdown sequence stops producers by setting
submittingBlocks and calling cancel(), but it never waits for the producer
WaitGroup (wg) to finish, allowing producers to still call engine.OnSyncedBlocks
during shutdown; update the sequence in the test to call wg.Wait() (the
WaitGroup that tracks producer goroutines) after submittingBlocks.Store(false)
and cancel() and before calling synctest.Wait() and checking
moduleutil.AllDone(engine, followerLoop), so producers are fully stopped prior
to asserting engine shutdown.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e9ea929 and 2939273.

📒 Files selected for processing (1)
  • engine/common/follower/integration_test.go

Comment on lines +163 to +167
select {
case <-allReady:
default:
t.Fatal("engine failed to start")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Reading full context of integration_test.go around lines 163-167 ==="
wc -l engine/common/follower/integration_test.go

echo
echo "=== Context around lines 155-175 (first fatal check) ==="
sed -n '155,175p' engine/common/follower/integration_test.go

echo
echo "=== Context around lines 245-260 (second fatal check) ==="
sed -n '245,260p' engine/common/follower/integration_test.go

echo
echo "=== Check for cleanup patterns in the test function ==="
# Find the test function containing these lines
sed -n '100,300p' engine/common/follower/integration_test.go | grep -n "func Test\|defer\|cancel\|wg\\.Wait\|t\\.Cleanup"

Repository: onflow/flow-go

Length of output: 1796


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full test function structure (broader view) ==="
# Find where the test function starts
sed -n '1,170p' engine/common/follower/integration_test.go | head -100

Repository: onflow/flow-go

Length of output: 3742


Ensure engine lifecycle checks don't skip cancellation in failure paths.

Using t.Fatal() for start/stop checks will prevent the cancel() call at line 250, leaving the mock context improperly shut down. Replace with t.Errorf() and controlled early returns to preserve graceful shutdown semantics, especially important given this test validates the lifecycle management guarantees.

♻️ Proposed refactor
 		select {
 		case <-allReady:
 		default:
-			t.Fatal("engine failed to start")
+			t.Errorf("engine failed to start")
+			cancel()
+			return
 		}
@@
 		select {
 		case <-allDone:
 		default:
-			t.Fatal("engine failed to stop")
+			t.Errorf("engine failed to stop")
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
select {
case <-allReady:
default:
t.Fatal("engine failed to start")
}
select {
case <-allReady:
default:
t.Errorf("engine failed to start")
cancel()
return
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/common/follower/integration_test.go` around lines 163 - 167, The
select that checks <-allReady currently calls t.Fatal("engine failed to start")
which aborts the test immediately and prevents the later cancel() from running;
change that to t.Errorf("engine failed to start") followed by a controlled early
return (or t.FailNow replacement with a defer-aware exit) so the test function
can reach the cancel() cleanup. Locate the select using the allReady channel and
replace the t.Fatal call with t.Errorf plus an immediate return to preserve
graceful shutdown of the mock context and ensure cancel() is invoked.

@peterargue peterargue changed the title fix: prevent pebble:closed panic in TestFollowerHappyPath shutdown [CI] refactor TestFollowerHappyPath to use synctest Mar 2, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@engine/common/follower/integration_test.go`:
- Around line 256-262: The test stops producers by setting
submittingBlocks.Store(false) but never waits for the worker goroutines to
finish, causing races when workers still call engine.OnSyncedBlocks() while the
defer cancels mockCtx/engine; after submittingBlocks.Store(false) add a
wg.Wait() to block until all workers have called wg.Done(), ensuring all
goroutines (that reference wg and call engine.OnSyncedBlocks()) complete before
the deferred shutdown of mockCtx/engine proceeds.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fb5fa213-d798-416f-9047-df487e189103

📥 Commits

Reviewing files that changed from the base of the PR and between 2939273 and b2d0887.

📒 Files selected for processing (1)
  • engine/common/follower/integration_test.go

Comment on lines +256 to +262
<-finalized

// stop engines and wait for graceful shutdown
cancel()
unittest.RequireCloseBefore(t, moduleutil.AllDone(engine, followerLoop), 10*time.Second, "engine failed to stop")
// Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock.
}()

// wait for target block to become finalized, this might take a while.
require.Eventually(t, func() bool {
final, err := followerState.Final().Head()
require.NoError(t, err)
success := final.Height == targetBlockHeight
if !success {
t.Logf("finalized height %d, waiting for %d", final.Height, targetBlockHeight)
} else {
t.Logf("successfully finalized target height %d\n", targetBlockHeight)
}
return success
}, 90*time.Second, time.Second, "expect to process all blocks before timeout")
// stop producers and wait for them to exit, then wait for engine shutdown.
submittingBlocks.Store(false)

// Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock.
// shutdown is in defer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing synchronization: worker goroutines are never waited on before shutdown.

The wg WaitGroup is created and used with Add/Done, but wg.Wait() is never called. After submittingBlocks.Store(false), the test immediately exits, triggering the defer which cancels the context and shuts down the engine. Workers may still be in the middle of calling engine.OnSyncedBlocks() on a shutting-down engine, creating a race condition.

🔧 Proposed fix: wait for workers before proceeding to shutdown
 		// Block until the target block is finalized. The callback above fires from within the
 		// HotStuff finalization goroutine (inside the bubble) and sends on this channel, which
 		// unblocks the main goroutine.
 		<-finalized

 		// stop producers and wait for them to exit, then wait for engine shutdown.
 		submittingBlocks.Store(false)
+		wg.Wait()

 		// Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock.
 		// shutdown is in defer

Based on learnings: "Use proper synchronization primitives to maintain state consistency, especially in concurrent or distributed contexts."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<-finalized
// stop engines and wait for graceful shutdown
cancel()
unittest.RequireCloseBefore(t, moduleutil.AllDone(engine, followerLoop), 10*time.Second, "engine failed to stop")
// Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock.
}()
// wait for target block to become finalized, this might take a while.
require.Eventually(t, func() bool {
final, err := followerState.Final().Head()
require.NoError(t, err)
success := final.Height == targetBlockHeight
if !success {
t.Logf("finalized height %d, waiting for %d", final.Height, targetBlockHeight)
} else {
t.Logf("successfully finalized target height %d\n", targetBlockHeight)
}
return success
}, 90*time.Second, time.Second, "expect to process all blocks before timeout")
// stop producers and wait for them to exit, then wait for engine shutdown.
submittingBlocks.Store(false)
// Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock.
// shutdown is in defer
<-finalized
// stop producers and wait for them to exit, then wait for engine shutdown.
submittingBlocks.Store(false)
wg.Wait()
// Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock.
// shutdown is in defer
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@engine/common/follower/integration_test.go` around lines 256 - 262, The test
stops producers by setting submittingBlocks.Store(false) but never waits for the
worker goroutines to finish, causing races when workers still call
engine.OnSyncedBlocks() while the defer cancels mockCtx/engine; after
submittingBlocks.Store(false) add a wg.Wait() to block until all workers have
called wg.Done(), ensuring all goroutines (that reference wg and call
engine.OnSyncedBlocks()) complete before the deferred shutdown of mockCtx/engine
proceeds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants