-
Notifications
You must be signed in to change notification settings - Fork 236
fix: sync service for non zero height starts with empty store #2834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
Claude finished @tac0turtle's task —— View job Comprehensive Code ReviewI've completed a thorough review of this PR which fixes the sync service initialization for non-zero height starts with empty stores. Below is my detailed feedback organized by focus area. 1. Code Quality & Architecturesync_service.go:58 - Atomic Boolean for Store InitializationPositive: The introduction of sync_service.go:128-181 - WriteToStoreAndBroadcast LogicIssue - Race Condition Risk: Lines 140-147 have a check-then-act pattern that could theoretically cause issues: if !syncService.storeInitialized.Load() {
var err error
storeInitialized, err = syncService.initStore(ctx, headerOrData)
if err != nil {
return fmt.Errorf("failed to initialize the store: %w", err)
}
syncService.storeInitialized.Store(true)
}Recommendation: Consider using sync_service.go:317-385 - initFromP2PWithRetry SimplificationPositive: The removal of the ability to start from arbitrary heights simplifies the logic significantly. The decision to either:
...is a sound design choice that reduces complexity and potential edge cases. Question for Author: In the review comment at line 317, julienrbrt asks "Why is that actually? You should be able to ignore the previous state checks for that block and just start from an arbitrary block?" - This deserves a clear response in the code comments explaining the state reconstruction requirement. Recommendation: Add a doc comment explaining: // initFromP2PWithRetry initializes the syncer from P2P with a retry mechanism.
// It inspects the local store to determine the first height to request:
// - when the store is empty AND we are not at genesis, we must start from genesis
// because we need to reconstruct state from the beginning
// - when the store already contains items, it reuses the latest height as the starting point
//
// Note: We cannot start from arbitrary heights because state reconstruction requires
// sequential application of all blocks from genesis.2. Error Handling & Resiliencesync_service.go:131-137 - Validation GuardsPositive: Good defensive programming with early validation:
sync_service.go:157-178 - Broadcast Error HandlingIssue - Unclear Error Suppression: The broadcast error suppression logic is complex and the conditions are hard to verify: if (firstStart && errors.Is(err, pubsub.ValidationError{Reason: pubsub.RejectValidationIgnored})) ||
((storeInitialized) && errors.Is(err, pubsub.ValidationError{Reason: pubsub.RejectValidationFailed})) {
return nil
}Concern: Using Recommendation:
sync_service.go:343-354 - Retry LogicPositive: Good exponential backoff with max timeout. However:
Question: Is 10 minutes the right timeout? For a node starting up, this could delay startup significantly if peers are unavailable. 3. SecurityNo Critical Security Issues FoundThe changes don't introduce obvious security vulnerabilities. The atomic boolean prevents race conditions that could lead to inconsistent state. Minor Observation: The 4. Performance & Resource Efficiencysync_service.go:140-147 - Double Store CheckInefficiency: The code checks store state twice:
Recommendation: Consider whether the atomic boolean is sufficient, or if the double-check is necessary for correctness. If both are needed, document why. sync_service.go:317-385 - Blocking InitializationDesign Question:
5. Testing & Reliabilitysync_service_test.go:117-168 - TestHeaderSyncServiceInitFromHigherHeightPositive: Good test coverage for the new scenario - starting from a height higher than genesis. Issue - Incomplete Test: The test at line 167 only calls Recommendation: Add test cases for:
Missing: Integration tests showing the fix works when:
6. Documentation & API ChangesProto Changes (state_rpc.proto:71-86)New RPC Method: API Design Question: Lines 82-85 use boolean flags bool head_present = 3;
P2PStoreEntry head = 4;
bool tail_present = 5;
P2PStoreEntry tail = 6;Observation (echoing julienrbrt's comment): The optional P2PStoreEntry head = 3;
optional P2PStoreEntry tail = 4;Response to julienrbrt's comment: While it may seem repetitive, the explicit flags make the API clearer for clients and avoid ambiguity between "not present" vs "present but default values". However, following protobuf conventions would be more idiomatic. Missing DocumentationIssue: The PR description and commit messages don't clearly explain:
Recommendation: Add to CHANGELOG.md (responding to julienrbrt's comment at config.go:196): ### Bug Fixes
- Fixed sync service initialization when starting from non-zero height with empty P2P store
- Previously, nodes would fail to initialize if they had no P2P store entries but were not at genesis
- Now correctly fetches genesis block from peers and rebuilds state sequentially7. Specific Code Issuespkg/config/config.go - Removed ConfigurationQuestion: The PR removes some configuration fields. This could be a breaking change for existing deployments. Was this intentional? If so, needs migration guide. block/internal/common/replay.go:70-78 - Execution Layer Ahead Checkif execHeight > targetHeight {
s.logger.Warn().
Uint64("target_height", targetHeight).
Uint64("exec_layer_height", execHeight).
Msg("execution layer is ahead of target height - skipping replay")
return nil
}Good: This change allows execution to be ahead, which is more lenient and matches the PR description "allow execution to be ahead as we will check apphash if anything is truly wrong". Concern: This could mask issues. If execution is ahead, how did that happen? Should this be an error in some cases? 8. Additional ObservationsTODO CommentsLine 36 in sync_service.go has a TODO about pruning. Consider opening an issue to track this if one doesn't exist. Code CoverageCodecov reports 30.76% patch coverage with 9 lines missing coverage. Priority areas for test coverage:
Summary & RecommendationsStrengths: Critical Issues: Recommended Changes Before Merge:
Questions for Author:
Overall, this is a solid fix that addresses a real issue. With the recommended improvements to error handling, testing, and documentation, it will be ready to merge. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2834 +/- ##
==========================================
- Coverage 64.75% 64.72% -0.04%
==========================================
Files 81 81
Lines 7170 7172 +2
==========================================
- Hits 4643 4642 -1
- Misses 1987 1990 +3
Partials 540 540
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
alpe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utAck 👍
|
testing a few other fixes so will wait off on merging this |
<!-- Please read and fill out this form before submitting your PR. Please make sure you have reviewed our contributors guide before submitting your first PR. NOTE: PR titles should follow semantic commits: https://www.conventionalcommits.org/en/v1.0.0/ --> ## Overview <!-- Please provide an explanation of the PR, including the appropriate context, background, goal, and rationale. If there is an issue with this information, please provide a tl;dr and link the issue. Ex: Closes #<issue number> --> add store-info command to inspect p2p store to see what is present
<!-- Please read and fill out this form before submitting your PR. Please make sure you have reviewed our contributors guide before submitting your first PR. NOTE: PR titles should follow semantic commits: https://www.conventionalcommits.org/en/v1.0.0/ --> ## Overview <!-- Please provide an explanation of the PR, including the appropriate context, background, goal, and rationale. If there is an issue with this information, please provide a tl;dr and link the issue. Ex: Closes #<issue number> --> This pr removes the trsuted hash approach to sync. this works for celestia node since they do not reconstruct state so they can jump to a height that is closer to the head. With Evolve this assumption is incorrect, we ned to reconstruct state, meaning we either need to sync from genesis or download a db snapshot then start the node from there.
|
|
||
| if err := headerOrData.Validate(); err != nil { | ||
| syncService.logger.Error().Err(err).Msg("failed to validate header") | ||
| panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove once we are caught up in order to verify if anything is going on
| heightToQuery uint64 | ||
| ) | ||
|
|
||
| if syncService.conf.Node.TrustedHash != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this as we can not start syncing from a random height as we are in the process of reconstructing state. While this is something that is useful for light mode, we should aim to simplify the system before working on light mode as today its not a priority.
Instead we check the p2p store to see if we have a height, if not then we go to genesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that actually? You should be able to ignore the previous state checks for that block and just start from an abitrary block?
julienrbrt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first walkthrough
| LazyBlockInterval DurationWrapper `mapstructure:"lazy_block_interval" yaml:"lazy_block_interval" comment:"Maximum interval between blocks in lazy aggregation mode (LazyAggregator). Ensures blocks are produced periodically even without transactions to keep the chain active. Generally larger than BlockTime."` | ||
|
|
||
| // Header configuration | ||
| TrustedHash string `mapstructure:"trusted_hash" yaml:"trusted_hash" comment:"Initial trusted hash used to bootstrap the header exchange service. Allows nodes to start synchronizing from a specific trusted point in the chain instead of genesis. When provided, the node will fetch the corresponding header/block from peers using this hash and use it as a starting point for synchronization. If not provided, the node will attempt to fetch the genesis block instead."` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a changelog for this?
| } | ||
|
|
||
| syncService.logger.Error().Err(err).Msg("failed to broadcast") | ||
| panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to remove as well
| heightToQuery uint64 | ||
| ) | ||
|
|
||
| if syncService.conf.Node.TrustedHash != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that actually? You should be able to ignore the previous state checks for that block and just start from an abitrary block?
| message P2PStoreSnapshot { | ||
| string label = 1; | ||
| uint64 height = 2; | ||
| bool head_present = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems repetitive, if the entry is empty it most likely means not present no?
Overview
sync service fix for when we are not on genesis but have an empty store
in 1.0.0-beta.9 we fixed writing to p2p store but the issue is when the store is empty AND we are not in genesis then the store is not initialised, this pr aims to fix this