Add Reorg Detection Infrastructure for Verifiers #205

asoliman92 · 2025-10-20T14:21:41Z

Add Reorg Detection Infrastructure for CCIP v1.7 Verifiers

Summary

This PR introduces blockchain reorganization detection capabilities for CCIP v1.7 verifiers to ensure safe message processing in the presence of chain reorgs and finality violations.

LCA algorithm for the reorg detection is going to be in another PR

Changes

Core Infrastructure

Chain Status Tracking (protocol/chain_status.go):

ChainTail: Data structure for tracking contiguous block headers with validation
- Stores blocks from stable tip (oldest) to latest tip (newest)
- Validates parent hash chain continuity and detects duplicate block numbers
- Methods: StableTip(), Tip(), Contains(), BlockByNumber()
ChainStatusReorg: Event type for regular reorgs with common ancestor information
ChainStatusFinalityViolated: Critical event when finalized blocks are reorged

Reorg Detector (protocol/reorg_detector.go, verifier/reorg_detector_service.go):

ReorgDetector interface for chain-agnostic reorg monitoring
ReorgDetectorService implementation that:
- Subscribes to block headers via SourceReader.SubscribeNewHeads()
- Maintains chain tail of 2×finality depth blocks (e.g., 128 blocks for 64 finality depth)
- Detects hash mismatches indicating reorgs
- Emits status events only when problems occur (reorg or finality violation)
Configurable finality depth per chain (default: 64 blocks)

Per-Chain State Management (verifier/verification_coordinator.go):

New sourceState struct: Encapsulates all per-chain state including:
- SourceReaderService instance
- ReorgDetector instance
- Per-chain pending task queue (pendingTasks []VerificationTask)
- Per-chain mutex for queue operations
- reorgInProgress atomic flag
- Chain status tracking
Isolation benefit: Reorgs on one chain only affect that chain's pending tasks, other chains continue uninterrupted
Replaces previous global queue architecture with per-chain queues

Coordinator Integration (verifier/verification_coordinator.go):

handleReorg(): Responds to regular reorgs by:
- Setting reorgInProgress flag immediately (blocks new task additions)
- Flushing pending tasks from affected chain's queue only
- Synchronously resetting SourceReaderService to common ancestor block with 30s timeout
- Waiting for reader reset to complete before proceeding
- Updating checkpoint to safe block number
- Clearing reorgInProgress flag only after reset completes
handleFinalityViolation(): Responds to finality violations by:
- Flushing all pending tasks from affected chain's queue
- Resetting checkpoint to safe restart block
- Stopping the source reader completely (requires manual intervention)

Architecture

sequenceDiagram
    participant RDS as ReorgDetectorService<br/>(Chain A)
    participant VC as VerificationCoordinator
    participant SS as sourceState<br/>(Chain A)
    participant SRS as SourceReaderService<br/>(Chain A)
    participant CM as CheckpointManager
    
    RDS->>VC: ChainStatus event<br/>(reorg detected)
    VC->>SS: Set reorgInProgress = true
    VC->>SS: Lock pendingMu
    VC->>SS: Flush reorged tasks<br/>(block > common ancestor)
    VC->>SS: Unlock pendingMu
    VC->>SRS: ResetToBlock(commonAncestor)<br/>[BLOCKING]
    SRS->>CM: WriteCheckpoint(commonAncestor) (only if finality violated)
    Note over VC,SRS: Coordinator waits here until<br/>reader confirms reset
    SRS-->>VC: Reset complete
    VC->>SS: Set reorgInProgress = false
    Note over VC: Chain A ready for new tasks
    Note over VC: Chains B, C, D unaffected

Flow:

ReorgDetectorService subscribes to block headers and maintains chain tail
On detecting reorg/finality violation → emits ChainStatus event
VerificationCoordinator receives event and invokes appropriate handler:
- Regular reorg:
  - Sets reorgInProgress=true flag (prevents new tasks)
  - Flushes reorged tasks from pending queue
  - Blocks waiting for SourceReaderService.ResetToBlock() to complete (30s timeout)
  - Updates checkpoint after successful reset
  - Clears reorgInProgress=false flag
- Finality violation:
  - Flushes all pending tasks
  - Stops reader completely
  - Resets checkpoint to safe restart block

Key Design Improvements

Per-Chain Queue Isolation:

Previous architecture: Single global pending task queue for all chains
New architecture: Each sourceState maintains its own pendingTasks queue
Benefits:
- Reorg on Chain A only flushes Chain A's pending tasks
- Chains B, C, D continue verification without interruption
- Independent reorgInProgress flags prevent race conditions per chain
- Cleaner separation of concerns and easier debugging

Synchronous Reset Behavior:

handleReorg() uses a deferred unlock pattern to ensure atomicity
The reorgInProgress flag prevents concurrent task additions during the entire reorg recovery
Reader reset is synchronous with a 30-second timeout context
No new tasks can be queued until the reader has confirmed reset to the common ancestor block
This prevents race conditions between reader state and pending task queue

Implementation Status

Completed:

✅ Core data structures (ChainTail, ChainStatus types)
✅ Interface definitions (ReorgDetector, updated SourceReader)
✅ Per-chain state management with isolated pending queues
✅ Coordinator reorg/finality violation handlers with synchronous reset
✅ Design documentation for full recovery flow

Deferred to Follow-up PRs:

⏳ ReorgDetectorService.Start() full implementation
⏳ monitorSubscription() algorithm (gap detection, tail updates, ancestor finding)
⏳ StatePollerService for automatic reader restart after finality violations
⏳ Aggregator gRPC endpoints for verifier state coordination
⏳ Comprehensive unit and integration tests

Notes

Finality violations trigger complete reader stop (no automatic recovery yet)
Per-chain isolation: reorg on one chain doesn't affect others

…ader

asoliman92 · 2025-10-20T14:24:40Z

verifier/reorg_detector_service.go

@@ -0,0 +1,204 @@
+package verifier


This is just a stub service with comments on what the algorithm will look like. Actual implementation will be in a subsequent PR

asoliman92 · 2025-10-20T14:25:23Z

verifier/verification_coordinator.go

+	reorgInProgress atomic.Bool // Set during reorg handling to prevent new tasks from being added
+
+	// Per-chain pending task queue
+	pendingTasks []VerificationTask


Moved it to be per chain instead of a global queue

asoliman92 · 2025-10-20T14:30:47Z

verifier/verification_coordinator.go

+			continue
+		}
+
+		sourceCfg, ok := vc.config.SourceConfigs[chainSelector]


will probably extract into smaller functions.

carte7000 · 2025-10-20T14:43:13Z

verifier/reorg_detector_service.go

+	// The chain tail is automatically sized to 2 * FinalityDepth to provide
+	// sufficient buffer for reorg detection before finality violations.
+	// Default: 64 blocks
+	FinalityDepth uint64


Do we want to only support finality depth. Most chains support Finality Tags to tell which blocks are final, using it would avoid us from having to find out the depth each time we onboard a new chain?

What do we do for the chains that doesn't have finality tag then?

CC @AndresJulia

This is where the configuration lives for 1.6, if FinalityTagEnabled is false we specify FinalityDepth (though please confirm this with @KodeyThomas or @simsonraj )

protocol/chain_status.go

carte7000 · 2025-10-20T15:05:11Z

verifier/reorg_detector_service.go

+//
+// The status channel will receive:
+// - ChainStatusReorg: When a reorg is detected (includes reorg depth and common ancestor)
+// - ChainStatusFinalityViolated: When a block deeper than FinalityDepth is reorged


With RMN we had multiple times where we have a finality violation but there was no messages in either the new or old tail. I am wondering if it's worth checking for this here.

For instance we can detect a finality violation but that's not a big deal since there was no message that was re-orged

That would play well with inconsistent chains that have low to non-existent traffic

I do love the idea. I think it'll complicate the logic and make the reorg detector which should be chain agnostic and even ccip agnostic tightly coupled to ccip. I don't know if we'll need to use it with executor as well or not but if we need it, it'll complicate things even further.

@0xAustinWang @winder Do you think we'll need the reorg component with executors at all?

winder · 2025-10-20T15:13:35Z

verifier/interfaces.go

+	// SubscribeNewHeads subscribes to new block headers.
+	// Returns a channel that receives new headers as they arrive.
+	// Implementation may poll internally and push to channel for chains without native subscriptions.
+	// The returned channel is closed when subscription ends or context is cancelled.
+	// Returns error if subscription cannot be established.
+	SubscribeNewHeads(ctx context.Context) (<-chan *protocol.BlockHeader, error)


I don't think this function is available in other clients and it leads to boilerplate code for calling LatestBlock functions. i.e.: solanas SubscribeToHeads

I know. It still abstracts this away from us. We want to have one API to subscribe instead of doing our own latestBlock. Also in case the underlying chain supports it, it gives us superior up to date heads.

// Implementation may poll internally and push to channel for chains without native subscriptions.

carte7000 · 2025-10-20T15:16:08Z

verifier/reorg_detector_service.go

+// - Sends notifications via channel only when reorgs or finality violations are detected
+//
+// Tail Sizing:
+// - Tail length = 2 * FinalityDepth (automatic, not configurable)


Not sure if that would happen in practise but is that possible that some RPC having some pruning which prevent us from rebuilding the tail? Especially if we support 2x finality depth?

Updated to only finalityDepth

winder · 2025-10-20T15:16:33Z

protocol/chain_status.go

+
+// ChainTail stores an ordered slice of block headers from stable tip to latest tip.
+type ChainTail struct {
+	blocks []BlockHeader // ordered from oldest (stable) to newest (tip)


How many blocks will you hold in memory?

The plan was to do finalityDepth*2 blocks. With @carte7000's comment maybe I'll use only finalityDepth.

You'd be able to detect a finality violation because the previous hashes would change, but you might not be able to find the common ancestor.

I'd like to know the common ancestor to update the latest safe block to start from with the checkpoint manager before stopping this reader. Probably can do this retroactively by checking all blocks instead of having them all in memory though.

If there's a finality violation beyond what you have in memory, how do you get the now-pruned blocks?

protocol/chain_status.go

winder · 2025-10-20T15:26:21Z

protocol/reorg_detector.go

+// ChainStatus is a marker interface for different chain status types.
+// Implementations: ChainStatusReorg, ChainStatusFinalityViolated


Why do this?

If you want guarantees on the enum type, use something like go-enum.

For this, something simple like type ChainStatus int seems totally sufficient.

It's because each one indicates different struct like that. Thought it'll be more intuitive that way rather than having a Type ChainStatus field in a struct with both CommonAncestorBlock and ViolatedBlock which doesn't make sense to have together. WDYT?

// ChainStatusReorg indicates a regular reorg was detected. type ChainStatusReorg struct { NewTail ChainTail CommonAncestorBlock uint64 // Block number of common ancestor for recovery } // ChainStatusFinalityViolated indicates a finality violation was detected (critical error). type ChainStatusFinalityViolated struct { ViolatedBlock BlockHeader // The finalized block that was reorged NewTail ChainTail // The new chain tail showing correct state SafeRestartBlock uint64 // Last known good block to restart from }

If I understand correctly, the coordinator need type casting on the status result in order to use the response:

select { case <- newMessage: /* normal processing things here */ case <- ctx.Done(): /* normal cancel stuff here */ case status <- reorgStatus: switch status.(type) { case ChainStatusReorg: /* use ChainStatusReorg struct */ case ChainStatusFinalityViolated: /* use ChainStatusFinalityViolated struct */ default: panic("unknown status") }

ChainStatusReorg and ChainStatusFinalityViolated are actually the same error, the only difference is whether finality has been violated. CommonAncestorBlock and SafeRestartBlock are the same thing. What about a single status struct and no casting:

type ReorgStatus struct { NewTail ChainTail CommonAncestorBlock uint64 // Block number of common ancestor for recovery FinalityViolated bool }

verifier/reorg_detector_service.go

makramkd · 2025-10-20T17:30:49Z

Per-Chain Queue Isolation:

In the interest of making this PR just about re-org detection, does it make sense to do this isolation in another PR and then implement this re-org logic?

asoliman92 · 2025-10-21T10:45:14Z

In the interest of making this PR just about re-org detection, does it make sense to do this isolation in another PR and then implement this re-org logic?

Good point. I wanted to do it like you're saying initially but got carried away while doing it 😁. Basically the re-org logic was blocking all chains if one chain is being reorged and I couldn't leave it like that 😅

github-actions · 2025-10-22T06:15:50Z

E2E Smoke Test Results

Test Case	Status	Duration
`TestE2ESmoke/test_extra_args_v2_messages/src->dst_msg_execution_eoa_receiver`	`pass`	`9.02s`
`TestE2ESmoke/test_extra_args_v2_messages/dst->src_msg_execution_eoa_receiver`	`pass`	`10.01s`
`TestE2ESmoke/test_extra_args_v2_messages/1337->3337_msg_execution_mock_receiver`	`pass`	`10.01s`
`TestE2ESmoke/test_extra_args_v2_messages`	`pass`	`29.04s`
`TestE2ESmoke/test_extra_args_v3_messages/src_dst_msg_execution_with_EOA_receiver`	`pass`	`10.01s`
`TestE2ESmoke/test_extra_args_v3_messages/dst_src_msg_execution_with_EOA_receiver`	`pass`	`10.01s`
`TestE2ESmoke/test_extra_args_v3_messages/1337->3337_msg_execution_with_EOA_receiver`	`pass`	`10.01s`
`TestE2ESmoke/test_extra_args_v3_messages/src_dst_msg_execution_with_mock_receiver`	`pass`	`10.01s`
`TestE2ESmoke/test_extra_args_v3_messages/dst_src_msg_execution_with_mock_receiver`	`pass`	`10.01s`
`TestE2ESmoke/test_extra_args_v3_messages/src_dst_msg_execution_with_EOA_receiver_and_token_transfer`	`pass`	`10.02s`
`TestE2ESmoke/test_extra_args_v3_messages`	`pass`	`60.09s`
`TestE2ESmoke`	`pass`	`89.3s`

Full logs are available in the workflow artifacts.

github-actions · 2025-10-22T06:20:12Z

Code coverage report:

Package	`main`	`asoliman/reorg-detector`
aggregator	50.23%	50.27%
cciptestinterfaces	0.00%	0.00%
ccv-evm	0.00%	0.00%
cmd	0.00%	0.00%
executor	34.46%	34.46%
indexer	25.39%	33.56%
integration	4.80%	4.63%
protocol	42.27%	45.50%
verifier	47.54%	42.15%

asoliman92 added 13 commits October 15, 2025 14:30

Reorg detector architecture

d03bb4d

Use configurations properly

0e4b5aa

WIP remove InitialStatus from reorg detector interface

7b6bb12

Make pending tasks per source chain

c4dcbb7

Fix tests

b706ec9

Introduce reorg_detector_service

80a0735

implement new functions from source reader interface in evm source re…

44886cc

…ader

update mock and cleanups

014a773

Sync resetBlock in source reader service

275ce77

Merge branch 'main' into asoliman/reorg-detector

68b437d

clean up

43c993d

clean up

366f70f

Move checkpoint manager call to source_reader_service.go

ac5fac1

asoliman92 commented Oct 20, 2025

View reviewed changes

asoliman92 changed the title ~~Add Reorg Detection Infrastructure for CCIP v1.7 Verifiers~~ Add Reorg Detection Infrastructure for Verifiers Oct 20, 2025

fix

504eca0

asoliman92 commented Oct 20, 2025

View reviewed changes

carte7000 reviewed Oct 20, 2025

View reviewed changes

Merge branch 'main' into asoliman/reorg-detector

af3005d