Skip to content

Conversation

@asoliman92
Copy link
Contributor

@asoliman92 asoliman92 commented Oct 20, 2025

Add Reorg Detection Infrastructure for CCIP v1.7 Verifiers

Summary

This PR introduces blockchain reorganization detection capabilities for CCIP v1.7 verifiers to ensure safe message processing in the presence of chain reorgs and finality violations.

LCA algorithm for the reorg detection is going to be in another PR

Changes

Core Infrastructure

Chain Status Tracking (protocol/chain_status.go):

  • ChainTail: Data structure for tracking contiguous block headers with validation
    • Stores blocks from stable tip (oldest) to latest tip (newest)
    • Validates parent hash chain continuity and detects duplicate block numbers
    • Methods: StableTip(), Tip(), Contains(), BlockByNumber()
  • ChainStatusReorg: Event type for regular reorgs with common ancestor information
  • ChainStatusFinalityViolated: Critical event when finalized blocks are reorged

Reorg Detector (protocol/reorg_detector.go, verifier/reorg_detector_service.go):

  • ReorgDetector interface for chain-agnostic reorg monitoring
  • ReorgDetectorService implementation that:
    • Subscribes to block headers via SourceReader.SubscribeNewHeads()
    • Maintains chain tail of 2×finality depth blocks (e.g., 128 blocks for 64 finality depth)
    • Detects hash mismatches indicating reorgs
    • Emits status events only when problems occur (reorg or finality violation)
  • Configurable finality depth per chain (default: 64 blocks)

Per-Chain State Management (verifier/verification_coordinator.go):

  • New sourceState struct: Encapsulates all per-chain state including:
    • SourceReaderService instance
    • ReorgDetector instance
    • Per-chain pending task queue (pendingTasks []VerificationTask)
    • Per-chain mutex for queue operations
    • reorgInProgress atomic flag
    • Chain status tracking
  • Isolation benefit: Reorgs on one chain only affect that chain's pending tasks, other chains continue uninterrupted
  • Replaces previous global queue architecture with per-chain queues

Coordinator Integration (verifier/verification_coordinator.go):

  • handleReorg(): Responds to regular reorgs by:
    • Setting reorgInProgress flag immediately (blocks new task additions)
    • Flushing pending tasks from affected chain's queue only
    • Synchronously resetting SourceReaderService to common ancestor block with 30s timeout
    • Waiting for reader reset to complete before proceeding
    • Updating checkpoint to safe block number
    • Clearing reorgInProgress flag only after reset completes
  • handleFinalityViolation(): Responds to finality violations by:
    • Flushing all pending tasks from affected chain's queue
    • Resetting checkpoint to safe restart block
    • Stopping the source reader completely (requires manual intervention)

Architecture

sequenceDiagram
    participant RDS as ReorgDetectorService<br/>(Chain A)
    participant VC as VerificationCoordinator
    participant SS as sourceState<br/>(Chain A)
    participant SRS as SourceReaderService<br/>(Chain A)
    participant CM as CheckpointManager
    
    RDS->>VC: ChainStatus event<br/>(reorg detected)
    VC->>SS: Set reorgInProgress = true
    VC->>SS: Lock pendingMu
    VC->>SS: Flush reorged tasks<br/>(block > common ancestor)
    VC->>SS: Unlock pendingMu
    VC->>SRS: ResetToBlock(commonAncestor)<br/>[BLOCKING]
    SRS->>CM: WriteCheckpoint(commonAncestor) (only if finality violated)
    Note over VC,SRS: Coordinator waits here until<br/>reader confirms reset
    SRS-->>VC: Reset complete
    VC->>SS: Set reorgInProgress = false
    Note over VC: Chain A ready for new tasks
    Note over VC: Chains B, C, D unaffected
Loading

Flow:

  1. ReorgDetectorService subscribes to block headers and maintains chain tail
  2. On detecting reorg/finality violation → emits ChainStatus event
  3. VerificationCoordinator receives event and invokes appropriate handler:
    • Regular reorg:
      • Sets reorgInProgress=true flag (prevents new tasks)
      • Flushes reorged tasks from pending queue
      • Blocks waiting for SourceReaderService.ResetToBlock() to complete (30s timeout)
      • Updates checkpoint after successful reset
      • Clears reorgInProgress=false flag
    • Finality violation:
      • Flushes all pending tasks
      • Stops reader completely
      • Resets checkpoint to safe restart block

Key Design Improvements

Per-Chain Queue Isolation:

  • Previous architecture: Single global pending task queue for all chains
  • New architecture: Each sourceState maintains its own pendingTasks queue
  • Benefits:
    • Reorg on Chain A only flushes Chain A's pending tasks
    • Chains B, C, D continue verification without interruption
    • Independent reorgInProgress flags prevent race conditions per chain
    • Cleaner separation of concerns and easier debugging

Synchronous Reset Behavior:

  • handleReorg() uses a deferred unlock pattern to ensure atomicity
  • The reorgInProgress flag prevents concurrent task additions during the entire reorg recovery
  • Reader reset is synchronous with a 30-second timeout context
  • No new tasks can be queued until the reader has confirmed reset to the common ancestor block
  • This prevents race conditions between reader state and pending task queue

Implementation Status

Completed:

  • ✅ Core data structures (ChainTail, ChainStatus types)
  • ✅ Interface definitions (ReorgDetector, updated SourceReader)
  • ✅ Per-chain state management with isolated pending queues
  • ✅ Coordinator reorg/finality violation handlers with synchronous reset
  • ✅ Design documentation for full recovery flow

Deferred to Follow-up PRs:

  • ReorgDetectorService.Start() full implementation
  • monitorSubscription() algorithm (gap detection, tail updates, ancestor finding)
  • StatePollerService for automatic reader restart after finality violations
  • ⏳ Aggregator gRPC endpoints for verifier state coordination
  • ⏳ Comprehensive unit and integration tests

Notes

  • Finality violations trigger complete reader stop (no automatic recovery yet)
  • Per-chain isolation: reorg on one chain doesn't affect others

@@ -0,0 +1,204 @@
package verifier
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a stub service with comments on what the algorithm will look like. Actual implementation will be in a subsequent PR

reorgInProgress atomic.Bool // Set during reorg handling to prevent new tasks from being added

// Per-chain pending task queue
pendingTasks []VerificationTask
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to be per chain instead of a global queue

@asoliman92 asoliman92 changed the title Add Reorg Detection Infrastructure for CCIP v1.7 Verifiers Add Reorg Detection Infrastructure for Verifiers Oct 20, 2025
continue
}

sourceCfg, ok := vc.config.SourceConfigs[chainSelector]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will probably extract into smaller functions.

// The chain tail is automatically sized to 2 * FinalityDepth to provide
// sufficient buffer for reorg detection before finality violations.
// Default: 64 blocks
FinalityDepth uint64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to only support finality depth. Most chains support Finality Tags to tell which blocks are final, using it would avoid us from having to find out the depth each time we onboard a new chain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do for the chains that doesn't have finality tag then?

CC @AndresJulia

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the configuration lives for 1.6, if FinalityTagEnabled is false we specify FinalityDepth (though please confirm this with @KodeyThomas or @simsonraj )

//
// The status channel will receive:
// - ChainStatusReorg: When a reorg is detected (includes reorg depth and common ancestor)
// - ChainStatusFinalityViolated: When a block deeper than FinalityDepth is reorged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With RMN we had multiple times where we have a finality violation but there was no messages in either the new or old tail. I am wondering if it's worth checking for this here.

For instance we can detect a finality violation but that's not a big deal since there was no message that was re-orged

That would play well with inconsistent chains that have low to non-existent traffic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do love the idea. I think it'll complicate the logic and make the reorg detector which should be chain agnostic and even ccip agnostic tightly coupled to ccip. I don't know if we'll need to use it with executor as well or not but if we need it, it'll complicate things even further.

@0xAustinWang @winder Do you think we'll need the reorg component with executors at all?

Comment on lines 39 to 44
// SubscribeNewHeads subscribes to new block headers.
// Returns a channel that receives new headers as they arrive.
// Implementation may poll internally and push to channel for chains without native subscriptions.
// The returned channel is closed when subscription ends or context is cancelled.
// Returns error if subscription cannot be established.
SubscribeNewHeads(ctx context.Context) (<-chan *protocol.BlockHeader, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this function is available in other clients and it leads to boilerplate code for calling LatestBlock functions. i.e.: solanas SubscribeToHeads

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. It still abstracts this away from us. We want to have one API to subscribe instead of doing our own latestBlock. Also in case the underlying chain supports it, it gives us superior up to date heads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Implementation may poll internally and push to channel for chains without native subscriptions.

// - Sends notifications via channel only when reorgs or finality violations are detected
//
// Tail Sizing:
// - Tail length = 2 * FinalityDepth (automatic, not configurable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that would happen in practise but is that possible that some RPC having some pruning which prevent us from rebuilding the tail? Especially if we support 2x finality depth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to only finalityDepth


// ChainTail stores an ordered slice of block headers from stable tip to latest tip.
type ChainTail struct {
blocks []BlockHeader // ordered from oldest (stable) to newest (tip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many blocks will you hold in memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was to do finalityDepth*2 blocks. With @carte7000's comment maybe I'll use only finalityDepth.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd be able to detect a finality violation because the previous hashes would change, but you might not be able to find the common ancestor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to know the common ancestor to update the latest safe block to start from with the checkpoint manager before stopping this reader. Probably can do this retroactively by checking all blocks instead of having them all in memory though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a finality violation beyond what you have in memory, how do you get the now-pruned blocks?

Comment on lines 41 to 42
// ChainStatus is a marker interface for different chain status types.
// Implementations: ChainStatusReorg, ChainStatusFinalityViolated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want guarantees on the enum type, use something like go-enum.

For this, something simple like type ChainStatus int seems totally sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because each one indicates different struct like that. Thought it'll be more intuitive that way rather than having a Type ChainStatus field in a struct with both CommonAncestorBlock and ViolatedBlock which doesn't make sense to have together. WDYT?

// ChainStatusReorg indicates a regular reorg was detected.
type ChainStatusReorg struct {
	NewTail             ChainTail
	CommonAncestorBlock uint64 // Block number of common ancestor for recovery
}

// ChainStatusFinalityViolated indicates a finality violation was detected (critical error).
type ChainStatusFinalityViolated struct {
	ViolatedBlock    BlockHeader // The finalized block that was reorged
	NewTail          ChainTail   // The new chain tail showing correct state
	SafeRestartBlock uint64      // Last known good block to restart from
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the coordinator need type casting on the status result in order to use the response:

select {
    case <- newMessage: /* normal processing things here */
    case <- ctx.Done(): /* normal cancel stuff here */
    case status <- reorgStatus:
        switch status.(type) {
            case ChainStatusReorg:
                /* use ChainStatusReorg struct */
            case ChainStatusFinalityViolated:
               /* use ChainStatusFinalityViolated struct */
            default:
              panic("unknown status")
        }

ChainStatusReorg and ChainStatusFinalityViolated are actually the same error, the only difference is whether finality has been violated. CommonAncestorBlock and SafeRestartBlock are the same thing. What about a single status struct and no casting:

type ReorgStatus struct {
	NewTail             ChainTail
	CommonAncestorBlock uint64 // Block number of common ancestor for recovery
	FinalityViolated bool
}

@asoliman92 asoliman92 force-pushed the asoliman/reorg-detector branch from 2d5b87c to 34b1b56 Compare October 20, 2025 16:50
@makramkd
Copy link
Collaborator

Per-Chain Queue Isolation:

In the interest of making this PR just about re-org detection, does it make sense to do this isolation in another PR and then implement this re-org logic?

@asoliman92
Copy link
Contributor Author

In the interest of making this PR just about re-org detection, does it make sense to do this isolation in another PR and then implement this re-org logic?

Good point. I wanted to do it like you're saying initially but got carried away while doing it 😁. Basically the re-org logic was blocking all chains if one chain is being reorged and I couldn't leave it like that 😅

@asoliman92 asoliman92 marked this pull request as ready for review October 21, 2025 12:51
@asoliman92 asoliman92 requested review from a team and skudasov as code owners October 21, 2025 12:51
@asoliman92 asoliman92 enabled auto-merge (squash) October 21, 2025 14:21
@github-actions
Copy link

E2E Smoke Test Results

Test Case Status Duration
TestE2ESmoke/test_extra_args_v2_messages/src->dst_msg_execution_eoa_receiver pass 9.02s
TestE2ESmoke/test_extra_args_v2_messages/dst->src_msg_execution_eoa_receiver pass 10.01s
TestE2ESmoke/test_extra_args_v2_messages/1337->3337_msg_execution_mock_receiver pass 10.01s
TestE2ESmoke/test_extra_args_v2_messages pass 29.04s
TestE2ESmoke/test_extra_args_v3_messages/src_dst_msg_execution_with_EOA_receiver pass 10.01s
TestE2ESmoke/test_extra_args_v3_messages/dst_src_msg_execution_with_EOA_receiver pass 10.01s
TestE2ESmoke/test_extra_args_v3_messages/1337->3337_msg_execution_with_EOA_receiver pass 10.01s
TestE2ESmoke/test_extra_args_v3_messages/src_dst_msg_execution_with_mock_receiver pass 10.01s
TestE2ESmoke/test_extra_args_v3_messages/dst_src_msg_execution_with_mock_receiver pass 10.01s
TestE2ESmoke/test_extra_args_v3_messages/src_dst_msg_execution_with_EOA_receiver_and_token_transfer pass 10.02s
TestE2ESmoke/test_extra_args_v3_messages pass 60.09s
TestE2ESmoke pass 89.3s

Full logs are available in the workflow artifacts.

@github-actions
Copy link

Code coverage report:

Package main asoliman/reorg-detector
aggregator 50.23% 50.27%
cciptestinterfaces 0.00% 0.00%
ccv-evm 0.00% 0.00%
cmd 0.00% 0.00%
executor 34.46% 34.46%
indexer 25.39% 33.56%
integration 4.80% 4.63%
protocol 42.27% 45.50%
verifier 47.54% 42.15%

@asoliman92 asoliman92 merged commit 670710a into main Oct 22, 2025
9 checks passed
@asoliman92 asoliman92 deleted the asoliman/reorg-detector branch October 22, 2025 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants