Skip to content

Conversation

@tac0turtle
Copy link
Contributor

Overview

This pr removes the trsuted hash approach to sync. this works for celestia node since they do not reconstruct state so they can jump to a height that is closer to the head. With Evolve this assumption is incorrect, we ned to reconstruct state, meaning we either need to sync from genesis or download a db snapshot then start the node from there.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedNov 14, 2025, 11:11 AM

@claude
Copy link
Contributor

claude bot commented Nov 12, 2025

Claude finished @tac0turtle's task —— View job


Comprehensive Code Review - Trusted Hash Removal

I've completed a thorough review of PR #2838 which removes the trusted hash sync approach. Here are my findings across all focus areas:


1. Code Quality & Architecture ⚠️

Positive Changes:

  • Simplified sync logic: Removing the trusted hash path reduces complexity in initFromP2PWithRetry (pkg/sync/sync_service.go:306-368)
  • Better separation of concerns: The new approach inspects the local store to determine whether to fetch genesis or resume from existing head
  • Clearer intent: Code now explicitly handles two cases: empty store (fetch genesis) vs populated store (resume from head)

Concerns:

  1. Critical Logic Change (pkg/sync/sync_service.go:317-343):

    // OLD: Could start from trusted hash OR genesis
    if syncService.conf.Node.TrustedHash != "" {
        trustedHashBytes, err := hex.DecodeString(syncService.conf.Node.TrustedHash)
        // ... fetch by hash
    } else {
        // ... fetch by height (genesis)
    }
    
    // NEW: Always queries by height, either head or genesis
    head, headErr := syncService.store.Head(ctx)
    switch {
    case errors.Is(headErr, header.ErrNotFound), errors.Is(headErr, header.ErrEmptyStore):
        heightToQuery = syncService.genesis.InitialHeight
    case headErr != nil:
        return false, fmt.Errorf("failed to inspect local store head: %w", headErr)
    default:
        heightToQuery = head.Height()  // Resume from existing head
    }

    Issue: This fundamentally changes the sync semantics. If a node has partial state up to height H but the network is now at height H+1000, the node will request height H from peers. This assumes:

    • Peers retain historical data at height H
    • No pruning has occurred
    • State reconstruction from height H is always possible
  2. Incomplete Refactor (pkg/sync/sync_service.go:137-143 in PR version):

    storeInitialized, err := syncService.initStore(ctx, headerOrData)
    if err != nil {
        return fmt.Errorf("failed to initialize the store: %w", err)
    }

    The initStore return signature changed from error to (bool, error) to track whether initialization occurred. However, the current version in main still uses the old signature without the boolean return. This suggests the PR changes may not be fully reflected in the branch I'm reviewing.

  3. Configuration Kept But Unused (pkg/config/config.go:196):

    TrustedHash string `mapstructure:"trusted_hash" yaml:"trusted_hash" comment:"Initial trusted hash..."`

    The TrustedHash field remains in the configuration struct and is documented, but is completely ignored in the new implementation. This creates user confusion - why document a field that does nothing?


2. Error Handling & Resilience ⚠️

Positive:

  • Exponential backoff retry logic preserved (10-second max, 10-minute total timeout)
  • Proper error wrapping with context

Critical Concerns:

  1. No Fallback for Unavailable Heights (pkg/sync/sync_service.go:333-335 in PR):

    if trusted, err = syncService.ex.GetByHeight(ctx, heightToQuery); err != nil {
        return false, fmt.Errorf("failed to fetch height %d from peers: %w", heightToQuery, err)
    }

    If peers have pruned the requested height, the node is stuck in infinite retry (up to 10 minutes timeout). No mechanism to:

    • Detect pruning and request a later height
    • Download a state snapshot
    • Gracefully fail with actionable guidance
  2. State Reconstruction Assumption: The PR description states nodes must either:

    • Sync from genesis, OR
    • Download a DB snapshot

    But the code provides no mechanism for the second option. If a node starts midchain without genesis sync capability, it will fail.


3. Security 🔒

Positive:

  • Removes potential attack vector of malicious trusted hash injection
  • Forces validation from genesis or last known good state

Concerns:

  1. Trust Model Change: Previously, operators could bootstrap from a hash they trusted (e.g., from block explorer). Now they must either:

    • Trust genesis (if syncing from scratch)
    • Trust their existing local state (if resuming)

    This is more restrictive but not inherently less secure.

  2. No Validation of Fetched Height: When resuming from head.Height(), there's no verification that the fetched header matches the local head hash. A malicious peer could return a fork at the same height.


4. Performance & Resource Efficiency ⚡

Major Concern - Sync Performance Degradation:

The PR description acknowledges this isn't like Celestia node which can "jump to a height closer to the head." The new approach means:

  1. Full nodes joining midchain must sync from genesis (potentially millions of blocks)
  2. No fast-sync capability - every block must be processed for state reconstruction
  3. Network resource implications - peers must serve historical headers/data indefinitely

Impact on Node Operators:

  • Significantly longer sync times for new nodes
  • Higher bandwidth requirements
  • Potential need for centralized snapshot distribution (not implemented)

5. Testing & Reliability 🧪

Critical Gap - Test Coverage at 7.14%:

Codecov reports only 7.14% patch coverage with 13 lines missing coverage in sync_service.go. This is unacceptable for a core sync refactor.

Test File Analysis (node/full_node_integration_test.go):

The test file changes reveal:

  • Line 276-278: testSingleSequencerSingleFullNodeTrustedHash still exists but tests are removed (lines 72 deleted)
  • The test function signatures remain but implementation is gutted

Missing Test Cases:

  1. Node resuming from existing head (store has partial chain)
  2. Peers don't have the requested height (pruning scenario)
  3. Mismatch between local head height and network state
  4. Large gap between local head and network head
  5. Genesis fetch failure vs mid-chain fetch failure

6. Documentation & Developer Experience 📚

Positive:

  • Documentation updated to reflect new behavior (docs/learn/specs/header-sync.md:74)
  • Comment in code explains the new logic

Issues:

  1. Stale Configuration Docs (docs/learn/config.md):
    19 lines removed but TrustedHash documentation needs to explicitly state it's deprecated/ignored

  2. Migration Path Missing: No guide for operators currently using trusted hash. What should they do?

  3. Operational Impact Not Documented:

    • How long will genesis sync take?
    • When should operators use snapshots instead?
    • What are the peer retention requirements?

7. Specific Code Issues

Issue 1: Panic on Broadcast Errors (pkg/sync/sync_service.go:167-169 in PR)

syncService.logger.Error().Err(err).Msg("failed to broadcast")
panic(err)  // ⚠️ PANIC in production code

Severity: High
Rationale: Panicking on broadcast failure can crash the entire node. Errors should be returned or handled gracefully.

Issue 2: Panic on Validation Errors (pkg/sync/sync_service.go:154-156 in PR)

if err := headerOrData.Validate(); err != nil {
    syncService.logger.Error().Err(err).Msg("failed to validate header")
    panic(err)  // ⚠️ PANIC on validation failure
}

Severity: High
Rationale: This can be triggered by malicious/corrupted input. Should return error instead.

Issue 3: Magic Number - Genesis Check Removed (pkg/sync/sync_service.go:138 in current)

isGenesis := headerOrData.Height() == syncService.genesis.InitialHeight

This check is in the current code but the PR version relies solely on storeInitialized boolean. The genesis-specific error handling may be lost.


Recommendations

Critical (Must Address Before Merge):

  1. This PR is already merged - but these issues need follow-up PRs:

  2. Add Test Coverage: Bring patch coverage to >80%

    • Test store resume logic
    • Test pruning/unavailable height scenarios
    • Test error paths
  3. Remove Panics: Replace with proper error returns

    • Validation failures should not crash nodes
    • Broadcast failures should be retried or logged
  4. Document Migration: Add operator guide

    • What happens to existing trusted_hash configs?
    • How to perform genesis sync vs snapshot-based sync?
    • Expected sync times

High Priority:

  1. Remove or Deprecate TrustedHash Config:

    • Either remove entirely, or
    • Add deprecation warning when set, or
    • Clearly document it's ignored
  2. Add Snapshot Support: Implement the "download DB snapshot" path mentioned in PR description

  3. Improve Error Messages: When height fetch fails, provide actionable guidance

    return fmt.Errorf("failed to fetch height %d from peers (may be pruned): %w\n"+
        "Consider: (1) syncing from genesis, or (2) loading a state snapshot", heightToQuery, err)

Medium Priority:

  1. Add Metrics: Track:

    • Heights requested from peers
    • Fetch failures by reason
    • Time to complete initial sync
  2. Validate Resumed Head: When resuming from local head, verify hash matches

  3. Document Peer Requirements: Operators need to know peers must retain full history


Summary

Strengths:

  • Simplifies sync logic by removing the trusted hash path
  • Aligns with Evolve's state reconstruction requirements
  • Reduces configuration complexity

Weaknesses:

  • Very low test coverage (7.14%) for critical sync changes
  • Panics in production code that can crash nodes
  • No fallback for pruned/unavailable heights
  • Breaking change not clearly communicated to operators
  • Configuration inconsistency (TrustedHash still documented but ignored)

Overall Assessment: The refactor is architecturally sound for Evolve's state reconstruction model, but the implementation has significant gaps in error handling, testing, and operator experience. The PR should have been blocked on test coverage alone.

Recommendation for Follow-up: Create immediate follow-up PRs to address panics and test coverage before this reaches production.


@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2025

PR Preview Action v1.6.2
Preview removed because the pull request was closed.
2025-11-14 11:11 UTC

@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 7.14286% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.78%. Comparing base (90f206e) to head (dbf0749).

Files with missing lines Patch % Lines
pkg/sync/sync_service.go 0.00% 13 Missing ⚠️
Additional details and impacted files
@@                   Coverage Diff                   @@
##           marko/inspect_store    #2838      +/-   ##
=======================================================
+ Coverage                64.72%   64.78%   +0.05%     
=======================================================
  Files                       81       81              
  Lines                     7272     7272              
=======================================================
+ Hits                      4707     4711       +4     
+ Misses                    2013     2010       -3     
+ Partials                   552      551       -1     
Flag Coverage Δ
combined 64.78% <7.14%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Base automatically changed from marko/inspect_store to marko/sync_service_fix November 13, 2025 21:02
@tac0turtle tac0turtle marked this pull request as ready for review November 14, 2025 11:10
@tac0turtle tac0turtle merged commit f728e4d into marko/sync_service_fix Nov 14, 2025
12 of 17 checks passed
@tac0turtle tac0turtle deleted the marko/remove_trusted_hash branch November 14, 2025 11:10
@github-project-automation github-project-automation bot moved this to Done in Evolve Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants