[CI] refactor TestFollowerHappyPath to use synctest#8474
[CI] refactor TestFollowerHappyPath to use synctest#8474peterargue wants to merge 4 commits intomasterfrom
Conversation
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
📝 WalkthroughWalkthroughReworks the follower integration test to use synctest orchestration: wraps setup in a synctest bubble, uses an OnBlockFinalized consumer to signal completion via a Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ower-happy-path-pebble-panic
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@engine/common/follower/integration_test.go`:
- Line 220: Replace the fatal call to unittest.RequireReturnsBefore(wg.Wait,
...) inside the defer with a non-fatal timeout pattern so the defer can still
reach e.g. AllDone; specifically, remove the unittest.RequireReturnsBefore
invocation and instead spawn a goroutine that does wg.Wait() and signals a done
channel, then in the defer use select { case <-done: /* ok */ case
<-time.After(time.Second): t.Errorf("workers did not stop producing in time") }
so the defer continues and still calls AllDone (the defer's wait/cleanup logic
that uses AllDone must remain reachable).
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
engine/common/follower/integration_test.go (1)
247-257:⚠️ Potential issue | 🟠 MajorShutdown says workers are awaited, but
wg.Wait()is never reached.The comment says producers are waited out, but the code never waits on
wg. This can leave producer goroutines still callingengine.OnSyncedBlockswhile shutdown is in progress.🔧 Proposed fix
// stop producers and wait for them to exit, then wait for engine shutdown. submittingBlocks.Store(false) + workersDone := make(chan struct{}) + go func() { + defer close(workersDone) + wg.Wait() + }() cancel() allDone := moduleutil.AllDone(engine, followerLoop) synctest.Wait() + select { + case <-workersDone: + default: + t.Errorf("workers did not stop producing") + } select { case <-allDone: default: t.Fatal("engine failed to stop") }As per coding guidelines: "Use proper synchronization primitives to maintain state consistency, especially in concurrent or distributed contexts."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@engine/common/follower/integration_test.go` around lines 247 - 257, The shutdown sequence stops producers by setting submittingBlocks and calling cancel(), but it never waits for the producer WaitGroup (wg) to finish, allowing producers to still call engine.OnSyncedBlocks during shutdown; update the sequence in the test to call wg.Wait() (the WaitGroup that tracks producer goroutines) after submittingBlocks.Store(false) and cancel() and before calling synctest.Wait() and checking moduleutil.AllDone(engine, followerLoop), so producers are fully stopped prior to asserting engine shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@engine/common/follower/integration_test.go`:
- Around line 163-167: The select that checks <-allReady currently calls
t.Fatal("engine failed to start") which aborts the test immediately and prevents
the later cancel() from running; change that to t.Errorf("engine failed to
start") followed by a controlled early return (or t.FailNow replacement with a
defer-aware exit) so the test function can reach the cancel() cleanup. Locate
the select using the allReady channel and replace the t.Fatal call with t.Errorf
plus an immediate return to preserve graceful shutdown of the mock context and
ensure cancel() is invoked.
---
Duplicate comments:
In `@engine/common/follower/integration_test.go`:
- Around line 247-257: The shutdown sequence stops producers by setting
submittingBlocks and calling cancel(), but it never waits for the producer
WaitGroup (wg) to finish, allowing producers to still call engine.OnSyncedBlocks
during shutdown; update the sequence in the test to call wg.Wait() (the
WaitGroup that tracks producer goroutines) after submittingBlocks.Store(false)
and cancel() and before calling synctest.Wait() and checking
moduleutil.AllDone(engine, followerLoop), so producers are fully stopped prior
to asserting engine shutdown.
| select { | ||
| case <-allReady: | ||
| default: | ||
| t.Fatal("engine failed to start") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Reading full context of integration_test.go around lines 163-167 ==="
wc -l engine/common/follower/integration_test.go
echo
echo "=== Context around lines 155-175 (first fatal check) ==="
sed -n '155,175p' engine/common/follower/integration_test.go
echo
echo "=== Context around lines 245-260 (second fatal check) ==="
sed -n '245,260p' engine/common/follower/integration_test.go
echo
echo "=== Check for cleanup patterns in the test function ==="
# Find the test function containing these lines
sed -n '100,300p' engine/common/follower/integration_test.go | grep -n "func Test\|defer\|cancel\|wg\\.Wait\|t\\.Cleanup"Repository: onflow/flow-go
Length of output: 1796
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full test function structure (broader view) ==="
# Find where the test function starts
sed -n '1,170p' engine/common/follower/integration_test.go | head -100Repository: onflow/flow-go
Length of output: 3742
Ensure engine lifecycle checks don't skip cancellation in failure paths.
Using t.Fatal() for start/stop checks will prevent the cancel() call at line 250, leaving the mock context improperly shut down. Replace with t.Errorf() and controlled early returns to preserve graceful shutdown semantics, especially important given this test validates the lifecycle management guarantees.
♻️ Proposed refactor
select {
case <-allReady:
default:
- t.Fatal("engine failed to start")
+ t.Errorf("engine failed to start")
+ cancel()
+ return
}
@@
select {
case <-allDone:
default:
- t.Fatal("engine failed to stop")
+ t.Errorf("engine failed to stop")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| select { | |
| case <-allReady: | |
| default: | |
| t.Fatal("engine failed to start") | |
| } | |
| select { | |
| case <-allReady: | |
| default: | |
| t.Errorf("engine failed to start") | |
| cancel() | |
| return | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@engine/common/follower/integration_test.go` around lines 163 - 167, The
select that checks <-allReady currently calls t.Fatal("engine failed to start")
which aborts the test immediately and prevents the later cancel() from running;
change that to t.Errorf("engine failed to start") followed by a controlled early
return (or t.FailNow replacement with a defer-aware exit) so the test function
can reach the cancel() cleanup. Locate the select using the allReady channel and
replace the t.Fatal call with t.Errorf plus an immediate return to preserve
graceful shutdown of the mock context and ensure cancel() is invoked.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@engine/common/follower/integration_test.go`:
- Around line 256-262: The test stops producers by setting
submittingBlocks.Store(false) but never waits for the worker goroutines to
finish, causing races when workers still call engine.OnSyncedBlocks() while the
defer cancels mockCtx/engine; after submittingBlocks.Store(false) add a
wg.Wait() to block until all workers have called wg.Done(), ensuring all
goroutines (that reference wg and call engine.OnSyncedBlocks()) complete before
the deferred shutdown of mockCtx/engine proceeds.
| <-finalized | ||
|
|
||
| // stop engines and wait for graceful shutdown | ||
| cancel() | ||
| unittest.RequireCloseBefore(t, moduleutil.AllDone(engine, followerLoop), 10*time.Second, "engine failed to stop") | ||
| // Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock. | ||
| }() | ||
|
|
||
| // wait for target block to become finalized, this might take a while. | ||
| require.Eventually(t, func() bool { | ||
| final, err := followerState.Final().Head() | ||
| require.NoError(t, err) | ||
| success := final.Height == targetBlockHeight | ||
| if !success { | ||
| t.Logf("finalized height %d, waiting for %d", final.Height, targetBlockHeight) | ||
| } else { | ||
| t.Logf("successfully finalized target height %d\n", targetBlockHeight) | ||
| } | ||
| return success | ||
| }, 90*time.Second, time.Second, "expect to process all blocks before timeout") | ||
| // stop producers and wait for them to exit, then wait for engine shutdown. | ||
| submittingBlocks.Store(false) | ||
|
|
||
| // Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock. | ||
| // shutdown is in defer |
There was a problem hiding this comment.
Missing synchronization: worker goroutines are never waited on before shutdown.
The wg WaitGroup is created and used with Add/Done, but wg.Wait() is never called. After submittingBlocks.Store(false), the test immediately exits, triggering the defer which cancels the context and shuts down the engine. Workers may still be in the middle of calling engine.OnSyncedBlocks() on a shutting-down engine, creating a race condition.
🔧 Proposed fix: wait for workers before proceeding to shutdown
// Block until the target block is finalized. The callback above fires from within the
// HotStuff finalization goroutine (inside the bubble) and sends on this channel, which
// unblocks the main goroutine.
<-finalized
// stop producers and wait for them to exit, then wait for engine shutdown.
submittingBlocks.Store(false)
+ wg.Wait()
// Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock.
// shutdown is in deferBased on learnings: "Use proper synchronization primitives to maintain state consistency, especially in concurrent or distributed contexts."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| <-finalized | |
| // stop engines and wait for graceful shutdown | |
| cancel() | |
| unittest.RequireCloseBefore(t, moduleutil.AllDone(engine, followerLoop), 10*time.Second, "engine failed to stop") | |
| // Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock. | |
| }() | |
| // wait for target block to become finalized, this might take a while. | |
| require.Eventually(t, func() bool { | |
| final, err := followerState.Final().Head() | |
| require.NoError(t, err) | |
| success := final.Height == targetBlockHeight | |
| if !success { | |
| t.Logf("finalized height %d, waiting for %d", final.Height, targetBlockHeight) | |
| } else { | |
| t.Logf("successfully finalized target height %d\n", targetBlockHeight) | |
| } | |
| return success | |
| }, 90*time.Second, time.Second, "expect to process all blocks before timeout") | |
| // stop producers and wait for them to exit, then wait for engine shutdown. | |
| submittingBlocks.Store(false) | |
| // Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock. | |
| // shutdown is in defer | |
| <-finalized | |
| // stop producers and wait for them to exit, then wait for engine shutdown. | |
| submittingBlocks.Store(false) | |
| wg.Wait() | |
| // Note: in case any error occur, the `mockCtx` will fail the test, due to the unexpected call of `Throw` on the mock. | |
| // shutdown is in defer |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@engine/common/follower/integration_test.go` around lines 256 - 262, The test
stops producers by setting submittingBlocks.Store(false) but never waits for the
worker goroutines to finish, causing races when workers still call
engine.OnSyncedBlocks() while the defer cancels mockCtx/engine; after
submittingBlocks.Store(false) add a wg.Wait() to block until all workers have
called wg.Done(), ensuring all goroutines (that reference wg and call
engine.OnSyncedBlocks()) complete before the deferred shutdown of mockCtx/engine
proceeds.
Problem
TestFollowerHappyPathinengine/common/followerflakily panics withpanic: pebble: closedwhen run under load alongside other engine tests. The panic manifests as a process crash rather than a clean test failure, obscuring what actually went wrong.Reproduced by running
go test ./engine/....Root Cause
This test failed with several different failure cases (pebble panic, timeouts, etc). The root cause is the test takes a long time to run in the happy path (~30+ seconds). So under load from other concurrently running tests, it can take much longer. the test was regularly timing out after 90 second. This timeout would cause the test teardown logic to skip waiting for the components to finish shutting down, resulting in the pebble panics.
Fix
Switch the test to using synctest to reduce blocking caused by synchronization. Maintain the original test's randomness to ensure it continues to perform the same test.
Summary by CodeRabbit