Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Nov 7, 2025

Summary

  • replace the legacy connect/handshake path with the trimmed two-message flow
  • consolidate ring + transport plumbing (connection manager counters, packet handling, tracing)
  • add GitHub workflow that runs the river six-peer regression on the freenet-128gb runner

Testing

  • cargo test -p freenet connect::tests
  • cargo test -p freenet node::network_bridge::priority_select::tests
  • cargo test connectivity::test_gateway_reconnection
  • FREENET_CORE_PATH=/home/ian/code/freenet/freenet-core/courtesy-connection-budget cargo test --test message_flow river_message_flow_over_freenet_six_peers_five_rounds -- --ignored --exact

@sanity sanity force-pushed the stack/connect-transport-rewrite branch from 9e98c1c to 967d4d9 Compare November 7, 2025 18:47
@sanity
Copy link
Collaborator Author

sanity commented Nov 7, 2025

@claude Please review PR

@claude
Copy link

claude bot commented Nov 7, 2025

Claude finished @sanity's task —— View job


PR Review: Transport/Handshake Pipeline Refactor

I've completed a comprehensive review of this PR. This is a significant refactoring that simplifies the connect/handshake flow and consolidates connection management logic. Overall, the changes are well-structured and improve code maintainability.

Summary of Changes

This PR accomplishes:

  • Replaces legacy multi-stage connect handshake with a streamlined two-message flow
  • Consolidates ring + transport connection management
  • Removes ~1500 lines of complex handshake logic
  • Adds CI workflow for six-peer regression testing
  • Net reduction of ~1300 lines of code

Key Findings

1. Handshake Simplification (✅ Major Improvement)

File: crates/core/src/node/network_bridge/handshake.rs

The new handshake handler is dramatically simpler:

  • Before: 1732 lines of complex state machine logic
  • After: 225 lines of lightweight connection adapter
  • Removes 651 lines of handshake tests (deleted file: handshake/tests.rs)

Strengths:

  • Much cleaner separation of concerns
  • Handshake driver now just wires transport events to/from event loop
  • Higher-level routing decisions moved into ConnectOp where they belong

Concerns:

  • ⚠️ Loss of test coverage: The deleted handshake/tests.rs file contained 651 lines of tests. While the old code was complex, these tests validated important handshake scenarios. The PR description mentions some replacement tests (cargo test -p freenet connect::tests), but it's unclear if all scenarios are covered.

Recommendation: Consider adding comments or documentation explaining which test scenarios from the old handshake tests are now covered by the new connect::tests and which (if any) are intentionally not covered because they're obsolete.

2. Connect Operation Refactor (✅ Good Design)

File: crates/core/src/operations/connect.rs

The new two-message connect flow is elegant:

  • ConnectMsg::Request: Joiner sends request toward target with TTL and visited list
  • ConnectMsg::Response: Acceptors respond back along discovered path
  • ConnectMsg::ObservedAddress: Peers notify joiners of their observed public address

Strengths:

  • Clear state separation: JoinerState vs RelayState
  • TTL-based forwarding prevents infinite loops
  • Visited list prevents trivial routing cycles
  • Observable address discovery for NAT traversal

Code Quality:

  • Lines 189-254: RelayState::handle_request() is well-structured with clear action outputs
  • Lines 317-334: JoinerState::register_acceptance() has good progress tracking
  • Lines 766-834: join_ring_request() has proper error handling and backoff logic

Minor Issues:

  • Line 299: courtesy_hint() always returns self.op_manager.ring.open_connections() == 0 - this seems simplistic. Is there documentation explaining the courtesy connection strategy?
  • Lines 836-944: initial_join_procedure() spawns an infinite loop that checks connection health. Consider adding a way to gracefully shut down this task.

3. Connection Manager Changes (✅ Solid)

File: crates/core/src/ring/connection_manager.rs

Improvements:

  • Lines 122-266: should_accept() has excellent logging and atomic counter management
  • Lines 203-219: Gateway-specific direct-accept limit (GATEWAY_DIRECT_ACCEPT_LIMIT = 2) is a good safeguard
  • Lines 268-293: New record_pending_location() method properly tracks peers during handshake
  • Lines 386-428: update_peer_identity() handles provisional → final peer identity transitions

Potential Issues:

  • Lines 151-178: The reserved_connections CAS loop could theoretically retry forever if under extreme contention. Consider adding a retry limit.
  • Lines 345-351: prune_in_transit_connection() and prune_alive_connection() both call the same prune_connection() with a boolean flag - this is fine, but the naming could be clearer (e.g., was_alive instead of is_alive).

4. Network Bridge Changes (⚠️ Complex but Necessary)

File: crates/core/src/node/network_bridge/p2p_protoc.rs

This file grew from ~900 to ~2000 lines, absorbing handshake coordination logic.

Concerns:

  • Lines 347-360: Bug check for self-targeting messages logs as "BUG" and converts to inbound processing. This is good defensive programming, but the error message says "This indicates a routing logic error". If this shouldn't happen, should it panic in debug builds?
  • Lines 367-388: TOCTOU race condition mitigation is well-documented (line 367 comment), but the fallback logic at lines 374-388 (resolving by public key when address is unspecified) feels fragile. What happens if multiple connections exist with different addresses but same key?
  • Lines 449-497: Connection establishment is spawned into a background task with 20-second timeout. If this fails, the original operation may hang waiting for a connection that never completes. Is there a cleanup mechanism?

Strengths:

  • Lines 1319-1454: handle_handshake_action() consolidates handshake event processing cleanly
  • Lines 1498-1593: handle_successful_connection() properly coordinates connection manager updates
  • Lines 1605-1716: handle_peer_connection_msg() handles provisional peer identity updates well (lines 1608-1646)

5. Transport Layer (✅ Minor Cleanup)

File: crates/core/src/transport/peer_connection.rs

  • Removed 78 lines of unused code
  • File crates/core/src/transport/mod.rs removed 7 lines of dead code

This is good cleanup.

6. CI/Testing (✅ Good Addition)

File: .github/workflows/six-peer-regression.yml

  • New workflow runs River six-peer regression test
  • 120-minute timeout is generous
  • Tests on both stack/connect-transport-rewrite and main branches

Suggestion: Consider adding a fast-fail basic connectivity test that runs on every PR before the expensive 2-hour test.

7. Test Updates (✅ Updated Appropriately)

Multiple test files updated to match new connection patterns:

  • crates/core/tests/connectivity.rs: Updated gateway reconnection test
  • crates/core/tests/operations.rs: Updated for new connect message structure
  • crates/core/tests/error_notification.rs: Updated error scenarios

Issues Found

🔴 Critical

None found - the code appears functional.

🟡 Medium Priority

  1. Test Coverage Gap (handshake/tests.rs deleted):

    • The PR removes 651 lines of handshake tests
    • Need to verify all important scenarios are covered by new tests
    • Location: crates/core/src/node/network_bridge/handshake/tests.rs (deleted)
  2. Infinite Loop Task Management (connect.rs:836-944):

    • initial_join_procedure() spawns infinite loop with no shutdown mechanism
    • Could leak tasks on node restart
    • Location: crates/core/src/operations/connect.rs:836-944
  3. Potential Connection Leak (p2p_protoc.rs:449-497):

    • Background task spawned to wait for connection with 20s timeout
    • No obvious cleanup if original requester goes away
    • Location: crates/core/src/node/network_bridge/p2p_protoc.rs:449-497

🟢 Low Priority (Style/Documentation)

  1. Courtesy Connection Strategy (connect.rs:299):

    • Logic is simple (open_connections() == 0) but not documented
    • Location: crates/core/src/operations/connect.rs:299
  2. Reserved Connections CAS Loop (connection_manager.rs:151-178):

    • Could add retry limit for extreme contention scenarios
    • Location: crates/core/src/ring/connection_manager.rs:151-178
  3. Self-Routing Bug Detection (p2p_protoc.rs:347-360):

    • Logs as "BUG" but continues execution - consider panic in debug mode?
    • Location: crates/core/src/node/network_bridge/p2p_protoc.rs:347-360

Recommendations

  1. Before merge:

    • Add comments mapping old handshake test scenarios to new test coverage
    • Document the courtesy connection strategy (why open_connections() == 0?)
    • Review connection cleanup in background tasks (line 449 in p2p_protoc.rs)
  2. Follow-up PRs:

    • Add graceful shutdown for initial_join_procedure() infinite loop
    • Consider retry limits on CAS loops in connection_manager
    • Add metrics/observability for connection state transitions

Conclusion

This is a solid refactoring that significantly simplifies the codebase while preserving functionality. The new two-message connect flow is more maintainable than the legacy multi-stage approach. The main concern is ensuring test coverage remains comprehensive after removing the old handshake tests.

Recommendation: ✅ Approve with minor follow-up items

The code quality is good, and the concerns raised are mostly about documentation and future robustness rather than correctness issues. The reduction of ~1300 lines while maintaining functionality is impressive.


Review completed by Claude

@sanity sanity requested a review from iduartgomez November 8, 2025 01:40
@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

Line 299: courtesy_hint() always returns self.op_manager.ring.open_connections() == 0 - this seems simplistic. Is there documentation explaining the courtesy connection strategy?

Courtesy connections still need work, I have a local branch that removes courtesy connections - currently they'll stick around like normal connections which isn't ideal.

@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

[Codex] Addressed the outstanding feedback:

  • documented why the current courtesy hint is intentionally simple and tied it back to the follow-up branch that will finish the feature.
  • taught initial_join_procedure to return a JoinHandle and made the node runtime/testing harness keep and abort it, so the watchdog loop can be shut down cleanly when the node exits.

Tests: cargo test -p freenet connect::tests:: -- --nocapture

@sanity sanity force-pushed the stack/connect-transport-rewrite branch from ead6ba4 to d3318b5 Compare November 8, 2025 02:09
@sanity
Copy link
Collaborator Author

sanity commented Nov 8, 2025

[Codex] Rebased onto the latest stack/subscription-routing to clear the GitHub conflicts (the only manual resolution was in crates/core/src/operations/put.rs, making sure the new connect-origin fields coexist with the TTL/skip-list tweaks). Re-ran cargo test -p freenet connect::tests:: -- --nocapture after the rebase and pushed with --force-with-lease.

@sanity sanity force-pushed the stack/subscription-routing branch from 117dac9 to f4ad769 Compare November 9, 2025 16:56
@sanity sanity force-pushed the stack/connect-transport-rewrite branch 2 times, most recently from d3602a7 to 3cfdb91 Compare November 9, 2025 18:29
@sanity
Copy link
Collaborator Author

sanity commented Nov 9, 2025

[Codex] Rebased onto the latest stack/subscription-routing so the PR is conflict-free again. Only automated picks this time—no manual edits were required beyond what was already on the branch. Re-ran cargo test -p freenet connect::tests:: -- --nocapture after the rebase and force-pushed with lease.

@iduartgomez iduartgomez force-pushed the stack/subscription-routing branch from 4103ba5 to e1f507b Compare November 9, 2025 19:20
@sanity sanity force-pushed the stack/connect-transport-rewrite branch from 3cfdb91 to 9b7d958 Compare November 9, 2025 21:34
@sanity
Copy link
Collaborator Author

sanity commented Nov 9, 2025

[Codex] Rebased onto the latest stack/subscription-routing (which force-pushed earlier today). The only conflicts were the two session-actor commits that upstream already absorbed, so I kept the newer upstream implementation while replaying the stack. Targeted tests (cargo test -p freenet connect::tests:: -- --nocapture) pass, and the branch has been force-pushed with lease.

@sanity sanity force-pushed the stack/subscription-routing branch from e1f507b to 0a1a2f5 Compare November 10, 2025 18:50
@sanity sanity force-pushed the stack/connect-transport-rewrite branch from 9b7d958 to 7bac7e5 Compare November 10, 2025 18:55
@sanity sanity force-pushed the stack/connect-transport-rewrite branch from 7bac7e5 to c518296 Compare November 12, 2025 21:26
Base automatically changed from stack/subscription-routing to main November 12, 2025 22:34
@sanity sanity force-pushed the stack/connect-transport-rewrite branch from d59222f to 5003947 Compare November 12, 2025 23:16
@iduartgomez iduartgomez force-pushed the stack/connect-transport-rewrite branch from c67b254 to f2a9734 Compare November 13, 2025 08:32
Copy link
Collaborator

@iduartgomez iduartgomez Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this tests were testing different edge cases which we are not now covering anywhere. This edge cases all can and will happen.

One of the nice things of having this logic separated is that it was actually testable in isolation, now by merging it all with the connect operation we cannot do that. This means we probably have some regressions waiting around the corner.

The idea of the handshake handler was to segregate transient connections which hadn't been logically accepted from those which were not so they could be dropped easily without affecting on-going operations. I am reviewing the PR so I am not sure if that has been maintained with the changes done, but would be good having an overall answer to the above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude review the comment https://github.com/freenet/freenet-core/pull/2065/files#r2524017704 and analyze the PR with that in mind.

Copy link

@claude claude bot Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @iduartgomez's task —— View job


Analysis of Test Coverage and Transient Connection Handling

I've completed a thorough review of the deleted handshake tests and the new architecture. Here's my assessment of the concerns raised:


Test Coverage Analysis

What Was Deleted (651 lines from handshake/tests.rs)

The old test file contained 5 stream-based integration tests:

  1. test_stream_gateway_inbound_conn_success - Gateway accepts unsolicited inbound connection
  2. test_stream_gateway_inbound_conn_rejected - Gateway rejects inbound connection with AcceptedBy{accepted: false}
  3. test_stream_peer_to_gw_outbound_conn - Peer connects to gateway and receives acceptance
  4. test_stream_peer_to_peer_outbound_conn_succeeded - Peer-to-peer outbound connection success
  5. test_stream_peer_to_gw_outbound_conn_rejected - Complex multi-hop forwarding scenario with 6 forward attempts and eventual acceptance

These tests were integration-style tests that:

  • Used mock transport layer (TransportMock)
  • Tested the full handshake state machine with real network message exchanges
  • Validated multi-stage connect flows (StartJoinReq, ForwardRequest, AcceptedBy responses)
  • Exercised retry/forwarding logic when initial connections were rejected

What Exists Now

The new connect.rs test suite contains 3 unit tests (lines 950-1132):

  1. relay_accepts_when_policy_allows - Tests RelayState::handle_request acceptance logic
  2. relay_forwards_when_not_accepting - Tests forwarding when relay declines to accept
  3. joiner_tracks_acceptance - Tests JoinerState::register_acceptance
  4. init_join_request_initializes_state - Tests ConnectOp::initiate_join_request

These are pure unit tests that:

  • Mock the context layer (TestRelayContext)
  • Test individual functions in isolation
  • Do NOT involve actual network message passing
  • Do NOT test the full end-to-end flow

The integration tests in crates/core/tests/connectivity.rs:

  • test_gateway_reconnection - Tests reconnection after disconnect (high-level)
  • test_basic_gateway_connectivity - Tests WebSocket connection only
  • test_three_node_network_connectivity - Tests full mesh with PUT/GET operations

Critical Gap: Loss of Integration-Level Testing

The reviewer's concern is valid. Here's what's missing:

Edge Cases No Longer Covered

  1. Inbound connection rejection handling (test_stream_gateway_inbound_conn_rejected)

    • Old test validated that rejection messages properly cleaned up state
    • New code has NO test for inbound rejection flow
  2. Multi-hop forwarding logic (test_stream_peer_to_gw_outbound_conn_rejected)

    • Old test validated 6 forward attempts with eventual success
    • Tested TTL countdown, visited list updates, and forward chain
    • New code has NO integration test for multi-hop scenarios
  3. Concurrent connection state transitions

    • Old tests used async select loops to validate state machine transitions
    • New code unit tests don't exercise concurrent message handling
  4. Transport-level error scenarios

    • Old tests could inject transport failures and test recovery
    • New tests use mocks that always succeed
  5. Message ordering and race conditions

    • Old tests validated that out-of-order messages were handled correctly
    • New tests don't exercise the p2p_protoc message dispatch logic

Transient Connection Segregation Analysis

The reviewer asks: "The idea of the handshake handler was to segregate transient connections which hadn't been logically accepted from those which were not so they could be dropped easily without affecting on-going operations."

Old Architecture (Deleted)

  • HandshakeHandler maintained complex state machines (InboundState, OutboundState)
  • Connections had explicit lifecycle states: Initializing, Accepting, Accepted, Rejected
  • Transient connections were tracked separately in the handler's state
  • Failed handshakes could be cleaned up without touching the ring

New Architecture

The segregation is now distributed across two subsystems:

  1. ConnectionManager tracks pending vs active via atomic counters:

    • reserved_connections - connections in handshake (transient)
    • open_connections - fully established connections
    • record_pending_location() - adds peer to routing before handshake completes (line 273)
    • prune_in_transit_connection() - cleans up failed handshakes (line 349)
    • prune_alive_connection() - cleans up active connections (line 345)
  2. HandshakeHandler is now a thin shim:

    • expected_inbound map tracks peers we're expecting (line 136)
    • Commands: Connect, ExpectInbound, DropConnection
    • Events: InboundConnection, OutboundEstablished, OutboundFailed
    • No state machine logic - just wires transport to event loop
  3. ConnectOp in p2p_protoc handles higher-level logic:

    • State: WaitingForResponses (joiner) vs relay state
    • Connection expectations registered via ExpectInbound command
    • Failures trigger prune_in_transit_connection() to clean up

Assessment: The segregation still exists but is less explicit:

  • ✅ Reserved connections are tracked separately from open connections
  • ✅ Failed handshakes are cleaned up via prune_in_transit_connection()
  • ⚠️ However: record_pending_location() makes pending peers immediately visible to the router (line 273-293 in connection_manager.rs), which could route operations to peers that haven't finished handshaking
  • ⚠️ Cleanup depends on p2p_protoc correctly calling prune_in_transit_connection() on every failure path

Regression Risks

Based on this analysis, I identify HIGH regression risk in these areas:

1. Multi-hop Forwarding (HIGH RISK)

  • Old test: test_stream_peer_to_gw_outbound_conn_rejected validated 6 forward attempts
  • New coverage: Unit test relay_forwards_when_not_accepting only tests single forward
  • Risk: TTL countdown bugs, visited list corruption, infinite forwarding loops

2. Connection State Cleanup (HIGH RISK)

  • Old test: Explicit state machine transitions with cleanup validation
  • New coverage: No integration test validating cleanup on failures
  • Risk: Reserved connection counter leaks, routing table pollution from failed handshakes

3. Concurrent Connection Attempts (MEDIUM RISK)

  • Old test: Multiple simultaneous connections with async race conditions
  • New coverage: Unit tests are synchronous
  • Risk: Race conditions in reserved_connections CAS loop, double-acceptance bugs

4. Inbound Rejection Flow (MEDIUM RISK)

  • Old test: test_stream_gateway_inbound_conn_rejected validated rejection messages
  • New coverage: None
  • Risk: Rejected connections not cleaned up properly

5. Routing to Pending Peers (NEW RISK)

  • Issue: record_pending_location() makes pending peers visible to router before handshake completes
  • Coverage: None - new behavior not tested
  • Risk: Operations routed to peers that fail handshake mid-flight

Recommendations

Immediate (Before Merge)

  1. Add integration test for multi-hop forwarding:

    #[freenet_test]
    async fn test_multi_hop_connect_with_ttl_exhaustion()

    Should validate: TTL countdown, visited list, eventual acceptance after N hops

  2. Add integration test for connection cleanup:

    #[freenet_test]
    async fn test_failed_handshake_cleanup()

    Should validate: reserved_connections decremented, location_for_peer cleaned up

  3. Document the pending peer routing behavior:

    • Add comments in record_pending_location() explaining why pending peers are made visible
    • Document what happens if an operation is routed to a peer that fails handshake

Follow-up (Post-Merge)

  1. Add stress test for concurrent connections:

    #[freenet_test]
    async fn test_concurrent_connection_attempts()

    Should validate: No counter overflows, proper CAS retries

  2. Add test for inbound rejection:

    #[freenet_test]
    async fn test_inbound_connection_rejected_by_policy()
  3. Consider adding back explicit transient state tracking:

    • Current architecture relies on correct cleanup calls from multiple places
    • A centralized "pending connections" tracker would be more robust

Conclusion

The reviewer's concerns are well-founded. This PR:

  • ✅ Successfully simplifies the codebase (-1300 lines)
  • ✅ Maintains separation between transient and active connections
  • Loses significant integration test coverage
  • Introduces new risks (routing to pending peers)
  • ⚠️ Makes the cleanup contract less explicit (distributed across multiple call sites)

The unit tests validate individual components, but do not prove that the full connect flow works correctly. The integration tests in connectivity.rs are higher-level (PUT/GET operations) and won't catch handshake-specific bugs.

Recommendation: Add at least the two critical integration tests (multi-hop forwarding + cleanup) before merging. The simplified architecture is an improvement, but it needs integration-level validation to avoid regressions.


Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iduartgomez Transient “courtesy” connections are still tracked, but they currently share the same connection budget as any other peer. PR #2067 reintroduces a dedicated courtesy bucket and the cleanup logic that made it easy to drop unaccepted sockets without touching steady-state neighbors. I left this PR focused on getting CI green and test coverage restored; the courtesy-specific handling (and broader connection-budget rewrite) continue in #2067 where the transport join flow is already being refactored to keep transient links separate again.

Copy link
Collaborator Author

@sanity sanity Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the coverage side I added:

  • crates/core/src/node/network_bridge/handshake.rs: an ExpectedInboundTracker that mirrors the old handshake handler’s transient-map semantics plus unit tests that exercise register/consume/drop/replacement. That gives us a deterministic signal if transient sockets ever stop being segregated.
  • crates/core/src/operations/connect.rs: multi_hop_forward_path_tracks_ttl_and_visited_peers, which reproduces the multi-hop forward/accept flow from handshake/tests.rs (TTL decrement, visited list updates, ExpectPeerConnection signalling when the downstream hop accepts).

Those don’t cover every single integration scenario we eventually want, but they plug the specific gaps the deleted tests protected until we build the larger harness in #2082. I agree the stronger architectural boundary still needs to return—#2075 tracks that follow-up so we can reintroduce a clean adapter instead of leaving the logic fused inside p2p_protoc.rs. Happy to start that as soon as this PR is through, unless you’d prefer we fold it into this diff instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2075 is related but not the same I am talking about I think. Here is about establishing a clear boundary between transient connections which are "in-transit" and other connections. Transit connections shouldn't be used for any operation other than the connect operation their are part of and shouldn't be considered part of the established connection pool of the node.

Copy link
Collaborator Author

@sanity sanity Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iduartgomez Gotcha, I created #2084 to address this and we can implement it after this is merged along with #2075 - unless you think this is something we should fix in this PR?

@sanity sanity force-pushed the stack/connect-transport-rewrite branch from f2a9734 to 2b64641 Compare November 13, 2025 16:19
@sanity sanity added this pull request to the merge queue Nov 14, 2025
Merged via the queue into main with commit 0a10654 Nov 14, 2025
28 checks passed
@sanity sanity deleted the stack/connect-transport-rewrite branch November 14, 2025 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants