-
-
Notifications
You must be signed in to change notification settings - Fork 107
refactor(transport): replace handshake pipeline #2065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9e98c1c to
967d4d9
Compare
|
@claude Please review PR |
|
Claude finished @sanity's task —— View job PR Review: Transport/Handshake Pipeline RefactorI've completed a comprehensive review of this PR. This is a significant refactoring that simplifies the connect/handshake flow and consolidates connection management logic. Overall, the changes are well-structured and improve code maintainability. Summary of ChangesThis PR accomplishes:
Key Findings1. Handshake Simplification (✅ Major Improvement)File: The new handshake handler is dramatically simpler:
Strengths:
Concerns:
Recommendation: Consider adding comments or documentation explaining which test scenarios from the old handshake tests are now covered by the new 2. Connect Operation Refactor (✅ Good Design)File: The new two-message connect flow is elegant:
Strengths:
Code Quality:
Minor Issues:
3. Connection Manager Changes (✅ Solid)File: Improvements:
Potential Issues:
4. Network Bridge Changes (
|
Courtesy connections still need work, I have a local branch that removes courtesy connections - currently they'll stick around like normal connections which isn't ideal. |
|
[Codex] Addressed the outstanding feedback:
Tests: |
ead6ba4 to
d3318b5
Compare
|
[Codex] Rebased onto the latest stack/subscription-routing to clear the GitHub conflicts (the only manual resolution was in |
117dac9 to
f4ad769
Compare
d3602a7 to
3cfdb91
Compare
|
[Codex] Rebased onto the latest |
4103ba5 to
e1f507b
Compare
3cfdb91 to
9b7d958
Compare
|
[Codex] Rebased onto the latest |
e1f507b to
0a1a2f5
Compare
9b7d958 to
7bac7e5
Compare
7bac7e5 to
c518296
Compare
d59222f to
5003947
Compare
c67b254 to
f2a9734
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this tests were testing different edge cases which we are not now covering anywhere. This edge cases all can and will happen.
One of the nice things of having this logic separated is that it was actually testable in isolation, now by merging it all with the connect operation we cannot do that. This means we probably have some regressions waiting around the corner.
The idea of the handshake handler was to segregate transient connections which hadn't been logically accepted from those which were not so they could be dropped easily without affecting on-going operations. I am reviewing the PR so I am not sure if that has been maintained with the changes done, but would be good having an overall answer to the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude review the comment https://github.com/freenet/freenet-core/pull/2065/files#r2524017704 and analyze the PR with that in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude finished @iduartgomez's task —— View job
Analysis of Test Coverage and Transient Connection Handling
I've completed a thorough review of the deleted handshake tests and the new architecture. Here's my assessment of the concerns raised:
Test Coverage Analysis
What Was Deleted (651 lines from handshake/tests.rs)
The old test file contained 5 stream-based integration tests:
test_stream_gateway_inbound_conn_success- Gateway accepts unsolicited inbound connectiontest_stream_gateway_inbound_conn_rejected- Gateway rejects inbound connection with AcceptedBy{accepted: false}test_stream_peer_to_gw_outbound_conn- Peer connects to gateway and receives acceptancetest_stream_peer_to_peer_outbound_conn_succeeded- Peer-to-peer outbound connection successtest_stream_peer_to_gw_outbound_conn_rejected- Complex multi-hop forwarding scenario with 6 forward attempts and eventual acceptance
These tests were integration-style tests that:
- Used mock transport layer (
TransportMock) - Tested the full handshake state machine with real network message exchanges
- Validated multi-stage connect flows (StartJoinReq, ForwardRequest, AcceptedBy responses)
- Exercised retry/forwarding logic when initial connections were rejected
What Exists Now
The new connect.rs test suite contains 3 unit tests (lines 950-1132):
relay_accepts_when_policy_allows- TestsRelayState::handle_requestacceptance logicrelay_forwards_when_not_accepting- Tests forwarding when relay declines to acceptjoiner_tracks_acceptance- TestsJoinerState::register_acceptanceinit_join_request_initializes_state- TestsConnectOp::initiate_join_request
These are pure unit tests that:
- Mock the context layer (
TestRelayContext) - Test individual functions in isolation
- Do NOT involve actual network message passing
- Do NOT test the full end-to-end flow
The integration tests in crates/core/tests/connectivity.rs:
test_gateway_reconnection- Tests reconnection after disconnect (high-level)test_basic_gateway_connectivity- Tests WebSocket connection onlytest_three_node_network_connectivity- Tests full mesh with PUT/GET operations
Critical Gap: Loss of Integration-Level Testing
The reviewer's concern is valid. Here's what's missing:
Edge Cases No Longer Covered
-
Inbound connection rejection handling (
test_stream_gateway_inbound_conn_rejected)- Old test validated that rejection messages properly cleaned up state
- New code has NO test for inbound rejection flow
-
Multi-hop forwarding logic (
test_stream_peer_to_gw_outbound_conn_rejected)- Old test validated 6 forward attempts with eventual success
- Tested TTL countdown, visited list updates, and forward chain
- New code has NO integration test for multi-hop scenarios
-
Concurrent connection state transitions
- Old tests used async select loops to validate state machine transitions
- New code unit tests don't exercise concurrent message handling
-
Transport-level error scenarios
- Old tests could inject transport failures and test recovery
- New tests use mocks that always succeed
-
Message ordering and race conditions
- Old tests validated that out-of-order messages were handled correctly
- New tests don't exercise the p2p_protoc message dispatch logic
Transient Connection Segregation Analysis
The reviewer asks: "The idea of the handshake handler was to segregate transient connections which hadn't been logically accepted from those which were not so they could be dropped easily without affecting on-going operations."
Old Architecture (Deleted)
HandshakeHandlermaintained complex state machines (InboundState,OutboundState)- Connections had explicit lifecycle states:
Initializing,Accepting,Accepted,Rejected - Transient connections were tracked separately in the handler's state
- Failed handshakes could be cleaned up without touching the ring
New Architecture
The segregation is now distributed across two subsystems:
-
ConnectionManagertracks pending vs active via atomic counters:reserved_connections- connections in handshake (transient)open_connections- fully established connectionsrecord_pending_location()- adds peer to routing before handshake completes (line 273)prune_in_transit_connection()- cleans up failed handshakes (line 349)prune_alive_connection()- cleans up active connections (line 345)
-
HandshakeHandleris now a thin shim:expected_inboundmap tracks peers we're expecting (line 136)- Commands:
Connect,ExpectInbound,DropConnection - Events:
InboundConnection,OutboundEstablished,OutboundFailed - No state machine logic - just wires transport to event loop
-
ConnectOpin p2p_protoc handles higher-level logic:- State:
WaitingForResponses(joiner) vs relay state - Connection expectations registered via
ExpectInboundcommand - Failures trigger
prune_in_transit_connection()to clean up
- State:
Assessment: The segregation still exists but is less explicit:
- ✅ Reserved connections are tracked separately from open connections
- ✅ Failed handshakes are cleaned up via
prune_in_transit_connection() ⚠️ However:record_pending_location()makes pending peers immediately visible to the router (line 273-293 in connection_manager.rs), which could route operations to peers that haven't finished handshaking⚠️ Cleanup depends onp2p_protoccorrectly callingprune_in_transit_connection()on every failure path
Regression Risks
Based on this analysis, I identify HIGH regression risk in these areas:
1. Multi-hop Forwarding (HIGH RISK)
- Old test:
test_stream_peer_to_gw_outbound_conn_rejectedvalidated 6 forward attempts - New coverage: Unit test
relay_forwards_when_not_acceptingonly tests single forward - Risk: TTL countdown bugs, visited list corruption, infinite forwarding loops
2. Connection State Cleanup (HIGH RISK)
- Old test: Explicit state machine transitions with cleanup validation
- New coverage: No integration test validating cleanup on failures
- Risk: Reserved connection counter leaks, routing table pollution from failed handshakes
3. Concurrent Connection Attempts (MEDIUM RISK)
- Old test: Multiple simultaneous connections with async race conditions
- New coverage: Unit tests are synchronous
- Risk: Race conditions in
reserved_connectionsCAS loop, double-acceptance bugs
4. Inbound Rejection Flow (MEDIUM RISK)
- Old test:
test_stream_gateway_inbound_conn_rejectedvalidated rejection messages - New coverage: None
- Risk: Rejected connections not cleaned up properly
5. Routing to Pending Peers (NEW RISK)
- Issue:
record_pending_location()makes pending peers visible to router before handshake completes - Coverage: None - new behavior not tested
- Risk: Operations routed to peers that fail handshake mid-flight
Recommendations
Immediate (Before Merge)
-
Add integration test for multi-hop forwarding:
#[freenet_test] async fn test_multi_hop_connect_with_ttl_exhaustion()
Should validate: TTL countdown, visited list, eventual acceptance after N hops
-
Add integration test for connection cleanup:
#[freenet_test] async fn test_failed_handshake_cleanup()
Should validate:
reserved_connectionsdecremented,location_for_peercleaned up -
Document the pending peer routing behavior:
- Add comments in
record_pending_location()explaining why pending peers are made visible - Document what happens if an operation is routed to a peer that fails handshake
- Add comments in
Follow-up (Post-Merge)
-
Add stress test for concurrent connections:
#[freenet_test] async fn test_concurrent_connection_attempts()
Should validate: No counter overflows, proper CAS retries
-
Add test for inbound rejection:
#[freenet_test] async fn test_inbound_connection_rejected_by_policy()
-
Consider adding back explicit transient state tracking:
- Current architecture relies on correct cleanup calls from multiple places
- A centralized "pending connections" tracker would be more robust
Conclusion
The reviewer's concerns are well-founded. This PR:
- ✅ Successfully simplifies the codebase (-1300 lines)
- ✅ Maintains separation between transient and active connections
- ❌ Loses significant integration test coverage
- ❌ Introduces new risks (routing to pending peers)
⚠️ Makes the cleanup contract less explicit (distributed across multiple call sites)
The unit tests validate individual components, but do not prove that the full connect flow works correctly. The integration tests in connectivity.rs are higher-level (PUT/GET operations) and won't catch handshake-specific bugs.
Recommendation: Add at least the two critical integration tests (multi-hop forwarding + cleanup) before merging. The simplified architecture is an improvement, but it needs integration-level validation to avoid regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iduartgomez Transient “courtesy” connections are still tracked, but they currently share the same connection budget as any other peer. PR #2067 reintroduces a dedicated courtesy bucket and the cleanup logic that made it easy to drop unaccepted sockets without touching steady-state neighbors. I left this PR focused on getting CI green and test coverage restored; the courtesy-specific handling (and broader connection-budget rewrite) continue in #2067 where the transport join flow is already being refactored to keep transient links separate again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the coverage side I added:
crates/core/src/node/network_bridge/handshake.rs: anExpectedInboundTrackerthat mirrors the old handshake handler’s transient-map semantics plus unit tests that exercise register/consume/drop/replacement. That gives us a deterministic signal if transient sockets ever stop being segregated.crates/core/src/operations/connect.rs:multi_hop_forward_path_tracks_ttl_and_visited_peers, which reproduces the multi-hop forward/accept flow fromhandshake/tests.rs(TTL decrement, visited list updates,ExpectPeerConnectionsignalling when the downstream hop accepts).
Those don’t cover every single integration scenario we eventually want, but they plug the specific gaps the deleted tests protected until we build the larger harness in #2082. I agree the stronger architectural boundary still needs to return—#2075 tracks that follow-up so we can reintroduce a clean adapter instead of leaving the logic fused inside p2p_protoc.rs. Happy to start that as soon as this PR is through, unless you’d prefer we fold it into this diff instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2075 is related but not the same I am talking about I think. Here is about establishing a clear boundary between transient connections which are "in-transit" and other connections. Transit connections shouldn't be used for any operation other than the connect operation their are part of and shouldn't be considered part of the established connection pool of the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iduartgomez Gotcha, I created #2084 to address this and we can implement it after this is merged along with #2075 - unless you think this is something we should fix in this PR?
f2a9734 to
2b64641
Compare
Summary
Testing