refactor(transport): replace handshake pipeline #2065

sanity · 2025-11-07T17:50:34Z

Summary

replace the legacy connect/handshake path with the trimmed two-message flow
consolidate ring + transport plumbing (connection manager counters, packet handling, tracing)
add GitHub workflow that runs the river six-peer regression on the freenet-128gb runner

Testing

cargo test -p freenet connect::tests
cargo test -p freenet node::network_bridge::priority_select::tests
cargo test connectivity::test_gateway_reconnection
FREENET_CORE_PATH=/home/ian/code/freenet/freenet-core/courtesy-connection-budget cargo test --test message_flow river_message_flow_over_freenet_six_peers_five_rounds -- --ignored --exact

sanity · 2025-11-07T22:51:21Z

@claude Please review PR

claude · 2025-11-07T22:51:40Z

Claude finished @sanity's task —— View job

PR Review: Transport/Handshake Pipeline Refactor

I've completed a comprehensive review of this PR. This is a significant refactoring that simplifies the connect/handshake flow and consolidates connection management logic. Overall, the changes are well-structured and improve code maintainability.

Summary of Changes

This PR accomplishes:

Replaces legacy multi-stage connect handshake with a streamlined two-message flow
Consolidates ring + transport connection management
Removes ~1500 lines of complex handshake logic
Adds CI workflow for six-peer regression testing
Net reduction of ~1300 lines of code

Key Findings

1. Handshake Simplification (✅ Major Improvement)

File: crates/core/src/node/network_bridge/handshake.rs

The new handshake handler is dramatically simpler:

Before: 1732 lines of complex state machine logic
After: 225 lines of lightweight connection adapter
Removes 651 lines of handshake tests (deleted file: handshake/tests.rs)

Strengths:

Much cleaner separation of concerns
Handshake driver now just wires transport events to/from event loop
Higher-level routing decisions moved into ConnectOp where they belong

Concerns:

⚠️ Loss of test coverage: The deleted handshake/tests.rs file contained 651 lines of tests. While the old code was complex, these tests validated important handshake scenarios. The PR description mentions some replacement tests (cargo test -p freenet connect::tests), but it's unclear if all scenarios are covered.

Recommendation: Consider adding comments or documentation explaining which test scenarios from the old handshake tests are now covered by the new connect::tests and which (if any) are intentionally not covered because they're obsolete.

2. Connect Operation Refactor (✅ Good Design)

File: crates/core/src/operations/connect.rs

The new two-message connect flow is elegant:

ConnectMsg::Request: Joiner sends request toward target with TTL and visited list
ConnectMsg::Response: Acceptors respond back along discovered path
ConnectMsg::ObservedAddress: Peers notify joiners of their observed public address

Strengths:

Clear state separation: JoinerState vs RelayState
TTL-based forwarding prevents infinite loops
Visited list prevents trivial routing cycles
Observable address discovery for NAT traversal

Code Quality:

Lines 189-254: RelayState::handle_request() is well-structured with clear action outputs
Lines 317-334: JoinerState::register_acceptance() has good progress tracking
Lines 766-834: join_ring_request() has proper error handling and backoff logic

Minor Issues:

Line 299: courtesy_hint() always returns self.op_manager.ring.open_connections() == 0 - this seems simplistic. Is there documentation explaining the courtesy connection strategy?
Lines 836-944: initial_join_procedure() spawns an infinite loop that checks connection health. Consider adding a way to gracefully shut down this task.

3. Connection Manager Changes (✅ Solid)

File: crates/core/src/ring/connection_manager.rs

Improvements:

Lines 122-266: should_accept() has excellent logging and atomic counter management
Lines 203-219: Gateway-specific direct-accept limit (GATEWAY_DIRECT_ACCEPT_LIMIT = 2) is a good safeguard
Lines 268-293: New record_pending_location() method properly tracks peers during handshake
Lines 386-428: update_peer_identity() handles provisional → final peer identity transitions

Potential Issues:

Lines 151-178: The reserved_connections CAS loop could theoretically retry forever if under extreme contention. Consider adding a retry limit.
Lines 345-351: prune_in_transit_connection() and prune_alive_connection() both call the same prune_connection() with a boolean flag - this is fine, but the naming could be clearer (e.g., was_alive instead of is_alive).

4. Network Bridge Changes (⚠️ Complex but Necessary)

File: crates/core/src/node/network_bridge/p2p_protoc.rs

This file grew from ~900 to ~2000 lines, absorbing handshake coordination logic.

Concerns:

Lines 347-360: Bug check for self-targeting messages logs as "BUG" and converts to inbound processing. This is good defensive programming, but the error message says "This indicates a routing logic error". If this shouldn't happen, should it panic in debug builds?
Lines 367-388: TOCTOU race condition mitigation is well-documented (line 367 comment), but the fallback logic at lines 374-388 (resolving by public key when address is unspecified) feels fragile. What happens if multiple connections exist with different addresses but same key?
Lines 449-497: Connection establishment is spawned into a background task with 20-second timeout. If this fails, the original operation may hang waiting for a connection that never completes. Is there a cleanup mechanism?

Strengths:

Lines 1319-1454: handle_handshake_action() consolidates handshake event processing cleanly
Lines 1498-1593: handle_successful_connection() properly coordinates connection manager updates
Lines 1605-1716: handle_peer_connection_msg() handles provisional peer identity updates well (lines 1608-1646)

5. Transport Layer (✅ Minor Cleanup)

File: crates/core/src/transport/peer_connection.rs

Removed 78 lines of unused code
File crates/core/src/transport/mod.rs removed 7 lines of dead code

This is good cleanup.

6. CI/Testing (✅ Good Addition)

File: .github/workflows/six-peer-regression.yml

New workflow runs River six-peer regression test
120-minute timeout is generous
Tests on both stack/connect-transport-rewrite and main branches

Suggestion: Consider adding a fast-fail basic connectivity test that runs on every PR before the expensive 2-hour test.

7. Test Updates (✅ Updated Appropriately)

Multiple test files updated to match new connection patterns:

crates/core/tests/connectivity.rs: Updated gateway reconnection test
crates/core/tests/operations.rs: Updated for new connect message structure
crates/core/tests/error_notification.rs: Updated error scenarios

Issues Found

🔴 Critical

None found - the code appears functional.

🟡 Medium Priority

Test Coverage Gap (handshake/tests.rs deleted):
- The PR removes 651 lines of handshake tests
- Need to verify all important scenarios are covered by new tests
- Location: crates/core/src/node/network_bridge/handshake/tests.rs (deleted)
Infinite Loop Task Management (connect.rs:836-944):
- initial_join_procedure() spawns infinite loop with no shutdown mechanism
- Could leak tasks on node restart
- Location: crates/core/src/operations/connect.rs:836-944
Potential Connection Leak (p2p_protoc.rs:449-497):
- Background task spawned to wait for connection with 20s timeout
- No obvious cleanup if original requester goes away
- Location: crates/core/src/node/network_bridge/p2p_protoc.rs:449-497

🟢 Low Priority (Style/Documentation)

Courtesy Connection Strategy (connect.rs:299):
- Logic is simple (open_connections() == 0) but not documented
- Location: crates/core/src/operations/connect.rs:299
Reserved Connections CAS Loop (connection_manager.rs:151-178):
- Could add retry limit for extreme contention scenarios
- Location: crates/core/src/ring/connection_manager.rs:151-178
Self-Routing Bug Detection (p2p_protoc.rs:347-360):
- Logs as "BUG" but continues execution - consider panic in debug mode?
- Location: crates/core/src/node/network_bridge/p2p_protoc.rs:347-360

Recommendations

Before merge:
- Add comments mapping old handshake test scenarios to new test coverage
- Document the courtesy connection strategy (why open_connections() == 0?)
- Review connection cleanup in background tasks (line 449 in p2p_protoc.rs)
Follow-up PRs:
- Add graceful shutdown for initial_join_procedure() infinite loop
- Consider retry limits on CAS loops in connection_manager
- Add metrics/observability for connection state transitions

Conclusion

This is a solid refactoring that significantly simplifies the codebase while preserving functionality. The new two-message connect flow is more maintainable than the legacy multi-stage approach. The main concern is ensuring test coverage remains comprehensive after removing the old handshake tests.

Recommendation: ✅ Approve with minor follow-up items

The code quality is good, and the concerns raised are mostly about documentation and future robustness rather than correctness issues. The reduction of ~1300 lines while maintaining functionality is impressive.

Review completed by Claude

sanity · 2025-11-08T01:43:33Z

Line 299: courtesy_hint() always returns self.op_manager.ring.open_connections() == 0 - this seems simplistic. Is there documentation explaining the courtesy connection strategy?

Courtesy connections still need work, I have a local branch that removes courtesy connections - currently they'll stick around like normal connections which isn't ideal.

sanity · 2025-11-08T02:06:41Z

[Codex] Addressed the outstanding feedback:

documented why the current courtesy hint is intentionally simple and tied it back to the follow-up branch that will finish the feature.
taught initial_join_procedure to return a JoinHandle and made the node runtime/testing harness keep and abort it, so the watchdog loop can be shut down cleanly when the node exits.

Tests: cargo test -p freenet connect::tests:: -- --nocapture

sanity · 2025-11-08T02:09:26Z

[Codex] Rebased onto the latest stack/subscription-routing to clear the GitHub conflicts (the only manual resolution was in crates/core/src/operations/put.rs, making sure the new connect-origin fields coexist with the TTL/skip-list tweaks). Re-ran cargo test -p freenet connect::tests:: -- --nocapture after the rebase and pushed with --force-with-lease.

sanity · 2025-11-09T18:29:14Z

[Codex] Rebased onto the latest stack/subscription-routing so the PR is conflict-free again. Only automated picks this time—no manual edits were required beyond what was already on the branch. Re-ran cargo test -p freenet connect::tests:: -- --nocapture after the rebase and force-pushed with lease.

sanity · 2025-11-09T21:34:15Z

[Codex] Rebased onto the latest stack/subscription-routing (which force-pushed earlier today). The only conflicts were the two session-actor commits that upstream already absorbed, so I kept the newer upstream implementation while replaying the stack. Targeted tests (cargo test -p freenet connect::tests:: -- --nocapture) pass, and the branch has been force-pushed with lease.

.github/workflows/six-peer-regression.yml

iduartgomez · 2025-11-13T15:56:02Z

crates/core/src/node/network_bridge/handshake/tests.rs

All this tests were testing different edge cases which we are not now covering anywhere. This edge cases all can and will happen.

One of the nice things of having this logic separated is that it was actually testable in isolation, now by merging it all with the connect operation we cannot do that. This means we probably have some regressions waiting around the corner.

The idea of the handshake handler was to segregate transient connections which hadn't been logically accepted from those which were not so they could be dropped easily without affecting on-going operations. I am reviewing the PR so I am not sure if that has been maintained with the changes done, but would be good having an overall answer to the above.

@claude review the comment https://github.com/freenet/freenet-core/pull/2065/files#r2524017704 and analyze the PR with that in mind.

Claude finished @iduartgomez's task —— View job

Analysis of Test Coverage and Transient Connection Handling

I've completed a thorough review of the deleted handshake tests and the new architecture. Here's my assessment of the concerns raised:

Test Coverage Analysis

What Was Deleted (651 lines from handshake/tests.rs)

The old test file contained 5 stream-based integration tests:

test_stream_gateway_inbound_conn_success - Gateway accepts unsolicited inbound connection

test_stream_gateway_inbound_conn_rejected - Gateway rejects inbound connection with AcceptedBy{accepted: false}

test_stream_peer_to_gw_outbound_conn - Peer connects to gateway and receives acceptance

test_stream_peer_to_peer_outbound_conn_succeeded - Peer-to-peer outbound connection success

test_stream_peer_to_gw_outbound_conn_rejected - Complex multi-hop forwarding scenario with 6 forward attempts and eventual acceptance

These tests were integration-style tests that:

Used mock transport layer (TransportMock)

Tested the full handshake state machine with real network message exchanges

Validated multi-stage connect flows (StartJoinReq, ForwardRequest, AcceptedBy responses)

Exercised retry/forwarding logic when initial connections were rejected

What Exists Now

The new connect.rs test suite contains 3 unit tests (lines 950-1132):

relay_accepts_when_policy_allows - Tests RelayState::handle_request acceptance logic

relay_forwards_when_not_accepting - Tests forwarding when relay declines to accept

joiner_tracks_acceptance - Tests JoinerState::register_acceptance

init_join_request_initializes_state - Tests ConnectOp::initiate_join_request

These are pure unit tests that:

Mock the context layer (TestRelayContext)

Test individual functions in isolation

Do NOT involve actual network message passing

Do NOT test the full end-to-end flow

The integration tests in crates/core/tests/connectivity.rs:

test_gateway_reconnection - Tests reconnection after disconnect (high-level)

test_basic_gateway_connectivity - Tests WebSocket connection only

test_three_node_network_connectivity - Tests full mesh with PUT/GET operations

Critical Gap: Loss of Integration-Level Testing

The reviewer's concern is valid. Here's what's missing:

Edge Cases No Longer Covered

Inbound connection rejection handling (test_stream_gateway_inbound_conn_rejected)

Old test validated that rejection messages properly cleaned up state

New code has NO test for inbound rejection flow

Multi-hop forwarding logic (test_stream_peer_to_gw_outbound_conn_rejected)

Old test validated 6 forward attempts with eventual success

Tested TTL countdown, visited list updates, and forward chain

New code has NO integration test for multi-hop scenarios

Concurrent connection state transitions

Old tests used async select loops to validate state machine transitions

New code unit tests don't exercise concurrent message handling

Transport-level error scenarios

Old tests could inject transport failures and test recovery

New tests use mocks that always succeed

Message ordering and race conditions

Old tests validated that out-of-order messages were handled correctly

New tests don't exercise the p2p_protoc message dispatch logic

Transient Connection Segregation Analysis

The reviewer asks: "The idea of the handshake handler was to segregate transient connections which hadn't been logically accepted from those which were not so they could be dropped easily without affecting on-going operations."

Old Architecture (Deleted)

HandshakeHandler maintained complex state machines (InboundState, OutboundState)

Connections had explicit lifecycle states: Initializing, Accepting, Accepted, Rejected

Transient connections were tracked separately in the handler's state

Failed handshakes could be cleaned up without touching the ring

New Architecture

The segregation is now distributed across two subsystems:

ConnectionManager tracks pending vs active via atomic counters:

reserved_connections - connections in handshake (transient)

open_connections - fully established connections

record_pending_location() - adds peer to routing before handshake completes (line 273)

prune_in_transit_connection() - cleans up failed handshakes (line 349)

prune_alive_connection() - cleans up active connections (line 345)

HandshakeHandler is now a thin shim:

expected_inbound map tracks peers we're expecting (line 136)

Commands: Connect, ExpectInbound, DropConnection

Events: InboundConnection, OutboundEstablished, OutboundFailed

No state machine logic - just wires transport to event loop

ConnectOp in p2p_protoc handles higher-level logic:

State: WaitingForResponses (joiner) vs relay state

Connection expectations registered via ExpectInbound command

Failures trigger prune_in_transit_connection() to clean up

Assessment: The segregation still exists but is less explicit:

✅ Reserved connections are tracked separately from open connections

✅ Failed handshakes are cleaned up via prune_in_transit_connection()

⚠️ However: record_pending_location() makes pending peers immediately visible to the router (line 273-293 in connection_manager.rs), which could route operations to peers that haven't finished handshaking

⚠️ Cleanup depends on p2p_protoc correctly calling prune_in_transit_connection() on every failure path

Regression Risks

Based on this analysis, I identify HIGH regression risk in these areas:

1. Multi-hop Forwarding (HIGH RISK)

Old test: test_stream_peer_to_gw_outbound_conn_rejected validated 6 forward attempts

New coverage: Unit test relay_forwards_when_not_accepting only tests single forward

Risk: TTL countdown bugs, visited list corruption, infinite forwarding loops

2. Connection State Cleanup (HIGH RISK)

Old test: Explicit state machine transitions with cleanup validation

New coverage: No integration test validating cleanup on failures

Risk: Reserved connection counter leaks, routing table pollution from failed handshakes

3. Concurrent Connection Attempts (MEDIUM RISK)

Old test: Multiple simultaneous connections with async race conditions

New coverage: Unit tests are synchronous

Risk: Race conditions in reserved_connections CAS loop, double-acceptance bugs

4. Inbound Rejection Flow (MEDIUM RISK)

Old test: test_stream_gateway_inbound_conn_rejected validated rejection messages

New coverage: None

Risk: Rejected connections not cleaned up properly

5. Routing to Pending Peers (NEW RISK)

Issue: record_pending_location() makes pending peers visible to router before handshake completes

Coverage: None - new behavior not tested

Risk: Operations routed to peers that fail handshake mid-flight

Recommendations

Immediate (Before Merge)

Add integration test for multi-hop forwarding:

#[freenet_test] async fn test_multi_hop_connect_with_ttl_exhaustion()

Should validate: TTL countdown, visited list, eventual acceptance after N hops

Add integration test for connection cleanup:

#[freenet_test] async fn test_failed_handshake_cleanup()

Should validate: reserved_connections decremented, location_for_peer cleaned up

Document the pending peer routing behavior:

Add comments in record_pending_location() explaining why pending peers are made visible

Document what happens if an operation is routed to a peer that fails handshake

Follow-up (Post-Merge)

Add stress test for concurrent connections:

#[freenet_test] async fn test_concurrent_connection_attempts()

Should validate: No counter overflows, proper CAS retries

Add test for inbound rejection:

#[freenet_test] async fn test_inbound_connection_rejected_by_policy()

Consider adding back explicit transient state tracking:

Current architecture relies on correct cleanup calls from multiple places

A centralized "pending connections" tracker would be more robust

Conclusion

The reviewer's concerns are well-founded. This PR:

✅ Successfully simplifies the codebase (-1300 lines)

✅ Maintains separation between transient and active connections

❌ Loses significant integration test coverage

❌ Introduces new risks (routing to pending peers)

⚠️ Makes the cleanup contract less explicit (distributed across multiple call sites)

The unit tests validate individual components, but do not prove that the full connect flow works correctly. The integration tests in connectivity.rs are higher-level (PUT/GET operations) and won't catch handshake-specific bugs.

Recommendation: Add at least the two critical integration tests (multi-hop forwarding + cleanup) before merging. The simplified architecture is an improvement, but it needs integration-level validation to avoid regressions.

@iduartgomez Transient “courtesy” connections are still tracked, but they currently share the same connection budget as any other peer. PR #2067 reintroduces a dedicated courtesy bucket and the cleanup logic that made it easy to drop unaccepted sockets without touching steady-state neighbors. I left this PR focused on getting CI green and test coverage restored; the courtesy-specific handling (and broader connection-budget rewrite) continue in #2067 where the transport join flow is already being refactored to keep transient links separate again.

For the coverage side I added:

crates/core/src/node/network_bridge/handshake.rs: an ExpectedInboundTracker that mirrors the old handshake handler’s transient-map semantics plus unit tests that exercise register/consume/drop/replacement. That gives us a deterministic signal if transient sockets ever stop being segregated.

crates/core/src/operations/connect.rs: multi_hop_forward_path_tracks_ttl_and_visited_peers, which reproduces the multi-hop forward/accept flow from handshake/tests.rs (TTL decrement, visited list updates, ExpectPeerConnection signalling when the downstream hop accepts).

Those don’t cover every single integration scenario we eventually want, but they plug the specific gaps the deleted tests protected until we build the larger harness in #2082. I agree the stronger architectural boundary still needs to return—#2075 tracks that follow-up so we can reintroduce a clean adapter instead of leaving the logic fused inside p2p_protoc.rs. Happy to start that as soon as this PR is through, unless you’d prefer we fold it into this diff instead.

#2075 is related but not the same I am talking about I think. Here is about establishing a clear boundary between transient connections which are "in-transit" and other connections. Transit connections shouldn't be used for any operation other than the connect operation their are part of and shouldn't be considered part of the established connection pool of the node.

@iduartgomez Gotcha, I created #2084 to address this and we can implement it after this is merged along with #2075 - unless you think this is something we should fix in this PR?

crates/freenet-macros/Cargo.toml

crates/core/src/node/network_bridge/p2p_protoc.rs

crates/core/src/node/p2p_impl.rs

sanity force-pushed the stack/connect-transport-rewrite branch from 9e98c1c to 967d4d9 Compare November 7, 2025 18:47

sanity requested a review from iduartgomez November 8, 2025 01:40

sanity force-pushed the stack/connect-transport-rewrite branch from ead6ba4 to d3318b5 Compare November 8, 2025 02:09

sanity force-pushed the stack/subscription-routing branch from 117dac9 to f4ad769 Compare November 9, 2025 16:56

sanity force-pushed the stack/connect-transport-rewrite branch 2 times, most recently from d3602a7 to 3cfdb91 Compare November 9, 2025 18:29

iduartgomez force-pushed the stack/subscription-routing branch from 4103ba5 to e1f507b Compare November 9, 2025 19:20

sanity force-pushed the stack/connect-transport-rewrite branch from 3cfdb91 to 9b7d958 Compare November 9, 2025 21:34

sanity added 7 commits November 10, 2025 12:45

fix: add update fallback propagation

b9d43f2

fix(update): keep local subscribers and add PUT ack tracing

f72b27c

fix(update): remove GET fallback propagation

6f7a90a

style(update): use direct equality in allow-self check

a3ce0c5

feat: harden subscription routing

6fa3d4b

fix(put): preserve upstream during broadcast

067b91f

fix(put): deliver SuccessfulPut directly to origin

0a1a2f5

sanity force-pushed the stack/subscription-routing branch from e1f507b to 0a1a2f5 Compare November 10, 2025 18:50

sanity force-pushed the stack/connect-transport-rewrite branch from 9b7d958 to 7bac7e5 Compare November 10, 2025 18:55

sanity added 6 commits November 10, 2025 13:59

fix(put): send direct ack to origin on final hop

75fc24e

fix(put): ensure broadcast hop acks original origin

422f5f5

fix(core): drain peer connection channel before waiting

c9ecc2d

fix(network): prevent peer connection listener send starvation

67a0288

refactor(transport): replace handshake pipeline

452a6fd

ci: add river six-peer regression

abb6a30

sanity added 6 commits November 12, 2025 15:20

ci: checkout freenet-test-network

b8184c9

ci: link freenet-test-network dependency

140fa41

ci: rely on crates.io test network

2ca74cb

fix: avoid PUT forward panic before location assigned

e6f79cf

fix(connect): add shutdown handle for initial join task

4fa8bc8

fix: remove unused oneshot import

c518296

sanity force-pushed the stack/connect-transport-rewrite branch from 7bac7e5 to c518296 Compare November 12, 2025 21:26

Base automatically changed from stack/subscription-routing to main November 12, 2025 22:34

ci: merge main into stack/connect-transport-rewrite

5003947

sanity force-pushed the stack/connect-transport-rewrite branch from d59222f to 5003947 Compare November 12, 2025 23:16