Skip to content

Conversation

@tac0turtle
Copy link
Contributor

Overview

sync service fix for when we are not on genesis but have an empty store

in 1.0.0-beta.9 we fixed writing to p2p store but the issue is when the store is empty AND we are not in genesis then the store is not initialised, this pr aims to fix this

@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 17, 2025, 10:05 AM

@claude
Copy link
Contributor

claude bot commented Nov 12, 2025

Claude finished @tac0turtle's task —— View job


Comprehensive Code Review

I've completed a thorough review of this PR which fixes the sync service initialization for non-zero height starts with empty stores. Below is my detailed feedback organized by focus area.


1. Code Quality & Architecture

sync_service.go:58 - Atomic Boolean for Store Initialization

Positive: The introduction of storeInitialized atomic.Bool is the right approach for thread-safe tracking of initialization state. This prevents race conditions between concurrent calls to WriteToStoreAndBroadcast.

sync_service.go:128-181 - WriteToStoreAndBroadcast Logic

Issue - Race Condition Risk: Lines 140-147 have a check-then-act pattern that could theoretically cause issues:

if !syncService.storeInitialized.Load() {
    var err error
    storeInitialized, err = syncService.initStore(ctx, headerOrData)
    if err != nil {
        return fmt.Errorf("failed to initialize the store: %w", err)
    }
    syncService.storeInitialized.Store(true)
}

Recommendation: Consider using atomic.CompareAndSwap for truly atomic initialization, or document why the current approach is safe (likely because this is only called from aggregator sequentially).

sync_service.go:317-385 - initFromP2PWithRetry Simplification

Positive: The removal of the ability to start from arbitrary heights simplifies the logic significantly. The decision to either:

  1. Start from genesis if store is empty
  2. Re-sync from existing head if store has data

...is a sound design choice that reduces complexity and potential edge cases.

Question for Author: In the review comment at line 317, julienrbrt asks "Why is that actually? You should be able to ignore the previous state checks for that block and just start from an arbitrary block?" - This deserves a clear response in the code comments explaining the state reconstruction requirement.

Recommendation: Add a doc comment explaining:

// initFromP2PWithRetry initializes the syncer from P2P with a retry mechanism.
// It inspects the local store to determine the first height to request:
//   - when the store is empty AND we are not at genesis, we must start from genesis
//     because we need to reconstruct state from the beginning
//   - when the store already contains items, it reuses the latest height as the starting point
//
// Note: We cannot start from arbitrary heights because state reconstruction requires
// sequential application of all blocks from genesis.

2. Error Handling & Resilience

sync_service.go:131-137 - Validation Guards

Positive: Good defensive programming with early validation:

  • Checking for zero initial height
  • Checking for zero header/data

sync_service.go:157-178 - Broadcast Error Handling

Issue - Unclear Error Suppression: The broadcast error suppression logic is complex and the conditions are hard to verify:

if (firstStart && errors.Is(err, pubsub.ValidationError{Reason: pubsub.RejectValidationIgnored})) ||
    ((storeInitialized) && errors.Is(err, pubsub.ValidationError{Reason: pubsub.RejectValidationFailed})) {
    return nil
}

Concern: Using panic() on line 177 for broadcast failures is aggressive. This will crash the entire node process.

Recommendation:

  1. Add comprehensive comments explaining why these specific errors are safe to ignore
  2. Consider whether panic is appropriate here or if we should return an error and let the caller decide
  3. Add tests that verify these error cases are handled correctly

sync_service.go:343-354 - Retry Logic

Positive: Good exponential backoff with max timeout. However:

  • Max backoff is 10 seconds
  • Overall timeout is 10 minutes
  • This means up to 60 retry attempts in worst case

Question: Is 10 minutes the right timeout? For a node starting up, this could delay startup significantly if peers are unavailable.


3. Security

No Critical Security Issues Found

The changes don't introduce obvious security vulnerabilities. The atomic boolean prevents race conditions that could lead to inconsistent state.

Minor Observation: The panic() calls could be considered a DoS vector if an attacker can trigger broadcast validation errors, but this appears to be an internal invariant violation rather than external input.


4. Performance & Resource Efficiency

sync_service.go:140-147 - Double Store Check

Inefficiency: The code checks store state twice:

  1. Line 140: !syncService.storeInitialized.Load()
  2. Inside initStore() at line 240: syncService.store.Head(ctx) which checks if store is empty

Recommendation: Consider whether the atomic boolean is sufficient, or if the double-check is necessary for correctness. If both are needed, document why.

sync_service.go:317-385 - Blocking Initialization

Design Question: initFromP2PWithRetry blocks for up to 10 minutes. For nodes with many peers, this could delay startup. Consider:

  • Shorter initial timeout with user feedback
  • Progress logging at regular intervals
  • Configurable timeout

5. Testing & Reliability

sync_service_test.go:117-168 - TestHeaderSyncServiceInitFromHigherHeight

Positive: Good test coverage for the new scenario - starting from a height higher than genesis.

Issue - Incomplete Test: The test at line 167 only calls WriteToStoreAndBroadcast once with height genesisDoc.InitialHeight + 5. This doesn't fully test the sync-from-genesis behavior.

Recommendation: Add test cases for:

  1. Empty store + non-genesis initial height → should fetch from genesis
  2. Store with existing data + new higher height → should sync from last stored height
  3. Concurrent calls to WriteToStoreAndBroadcast → verify thread safety
  4. Error cases in broadcast validation

Missing: Integration tests showing the fix works when:

  • Node crashes after producing blocks
  • Node restarts with empty P2P store but non-empty chain state
  • Node bootstraps from peers starting at non-genesis height

6. Documentation & API Changes

Proto Changes (state_rpc.proto:71-86)

New RPC Method: GetP2PStoreInfo is added to expose P2P store metadata.

API Design Question: Lines 82-85 use boolean flags head_present and tail_present:

bool head_present = 3;
P2PStoreEntry head = 4;
bool tail_present = 5;
P2PStoreEntry tail = 6;

Observation (echoing julienrbrt's comment): The *_present booleans seem redundant. In protobuf, optional fields can be checked for presence. Consider using:

optional P2PStoreEntry head = 3;
optional P2PStoreEntry tail = 4;

Response to julienrbrt's comment: While it may seem repetitive, the explicit flags make the API clearer for clients and avoid ambiguity between "not present" vs "present but default values". However, following protobuf conventions would be more idiomatic.

Missing Documentation

Issue: The PR description and commit messages don't clearly explain:

  1. What specific scenario this fixes
  2. How to reproduce the bug
  3. What the expected behavior is now

Recommendation: Add to CHANGELOG.md (responding to julienrbrt's comment at config.go:196):

### Bug Fixes
- Fixed sync service initialization when starting from non-zero height with empty P2P store
  - Previously, nodes would fail to initialize if they had no P2P store entries but were not at genesis
  - Now correctly fetches genesis block from peers and rebuilds state sequentially

7. Specific Code Issues

pkg/config/config.go - Removed Configuration

Question: The PR removes some configuration fields. This could be a breaking change for existing deployments. Was this intentional? If so, needs migration guide.

block/internal/common/replay.go:70-78 - Execution Layer Ahead Check

if execHeight > targetHeight {
    s.logger.Warn().
        Uint64("target_height", targetHeight).
        Uint64("exec_layer_height", execHeight).
        Msg("execution layer is ahead of target height - skipping replay")
    return nil
}

Good: This change allows execution to be ahead, which is more lenient and matches the PR description "allow execution to be ahead as we will check apphash if anything is truly wrong".

Concern: This could mask issues. If execution is ahead, how did that happen? Should this be an error in some cases?


8. Additional Observations

TODO Comments

Line 36 in sync_service.go has a TODO about pruning. Consider opening an issue to track this if one doesn't exist.

Code Coverage

Codecov reports 30.76% patch coverage with 9 lines missing coverage. Priority areas for test coverage:

  • Error paths in initFromP2PWithRetry
  • Broadcast error suppression logic
  • Concurrent initialization scenarios

Summary & Recommendations

Strengths:
✅ Correctly addresses the empty store + non-genesis height bug
✅ Simplifies sync logic by removing arbitrary height starts
✅ Adds useful observability via P2P store info RPC
✅ Uses atomic operations for thread safety

Critical Issues:
⚠️ Broadcast error handling uses panic() which crashes the node - needs review
⚠️ Race condition risk in store initialization check-then-act pattern
⚠️ Missing changelog entry (requested by reviewer)

Recommended Changes Before Merge:

  1. Add changelog entry explaining the fix
  2. Address the redundant *_present boolean fields in proto or document rationale
  3. Add doc comments explaining why arbitrary height sync was removed
  4. Consider replacing panic() with error returns
  5. Add integration test for the specific bug scenario
  6. Improve test coverage for error paths

Questions for Author:

  1. Why was the ability to start from arbitrary heights removed? (needs code comment)
  2. Is panic() on broadcast failure the intended behavior?
  3. Should the execution-ahead scenario be logged as warning or error?
  4. Are the removed config fields a breaking change?

Overall, this is a solid fix that addresses a real issue. With the recommended improvements to error handling, testing, and documentation, it will be ready to merge.


@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 30.76923% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.72%. Comparing base (bdef117) to head (c378824).

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 30.76% 8 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2834      +/-   ##
==========================================
- Coverage   64.75%   64.72%   -0.04%     
==========================================
  Files          81       81              
  Lines        7170     7172       +2     
==========================================
- Hits         4643     4642       -1     
- Misses       1987     1990       +3     
  Partials      540      540              
Flag Coverage Δ
combined 64.72% <30.76%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tac0turtle tac0turtle marked this pull request as ready for review November 12, 2025 14:14
@tac0turtle tac0turtle changed the title fix: sync service fix: sync service for non zero height starts Nov 12, 2025
@tac0turtle tac0turtle changed the title fix: sync service for non zero height starts fix: sync service for non zero height starts with empty store Nov 12, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://evstack.github.io/docs-preview/pr-2834/

Built to branch main at 2025-11-17 10:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

alpe
alpe previously approved these changes Nov 12, 2025
Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utAck 👍

@tac0turtle
Copy link
Contributor Author

testing a few other fixes so will wait off on merging this

<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->


add store-info command to inspect p2p store to see what is present
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->

This pr removes the trsuted hash approach to sync. this works for
celestia node since they do not reconstruct state so they can jump to a
height that is closer to the head. With Evolve this assumption is
incorrect, we ned to reconstruct state, meaning we either need to sync
from genesis or download a db snapshot then start the node from there.

if err := headerOrData.Validate(); err != nil {
syncService.logger.Error().Err(err).Msg("failed to validate header")
panic(err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove once we are caught up in order to verify if anything is going on

heightToQuery uint64
)

if syncService.conf.Node.TrustedHash != "" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this as we can not start syncing from a random height as we are in the process of reconstructing state. While this is something that is useful for light mode, we should aim to simplify the system before working on light mode as today its not a priority.

Instead we check the p2p store to see if we have a height, if not then we go to genesis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that actually? You should be able to ignore the previous state checks for that block and just start from an abitrary block?

Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first walkthrough

LazyBlockInterval DurationWrapper `mapstructure:"lazy_block_interval" yaml:"lazy_block_interval" comment:"Maximum interval between blocks in lazy aggregation mode (LazyAggregator). Ensures blocks are produced periodically even without transactions to keep the chain active. Generally larger than BlockTime."`

// Header configuration
TrustedHash string `mapstructure:"trusted_hash" yaml:"trusted_hash" comment:"Initial trusted hash used to bootstrap the header exchange service. Allows nodes to start synchronizing from a specific trusted point in the chain instead of genesis. When provided, the node will fetch the corresponding header/block from peers using this hash and use it as a starting point for synchronization. If not provided, the node will attempt to fetch the genesis block instead."`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a changelog for this?

}

syncService.logger.Error().Err(err).Msg("failed to broadcast")
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove as well

heightToQuery uint64
)

if syncService.conf.Node.TrustedHash != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that actually? You should be able to ignore the previous state checks for that block and just start from an abitrary block?

message P2PStoreSnapshot {
string label = 1;
uint64 height = 2;
bool head_present = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems repetitive, if the entry is empty it most likely means not present no?

@tac0turtle tac0turtle marked this pull request as draft November 15, 2025 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants