Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: network cancellability and deadlines #468

Merged
merged 4 commits into from
Oct 5, 2022

Conversation

tegefaulkes
Copy link
Contributor

@tegefaulkes tegefaulkes commented Sep 30, 2022

Description

This PR focuses on updating the network domain with timedCancellable, general cancellability and deadlines.

Issues Fixed

Tasks

Final checklist

  • Domain specific tests
  • Full tests
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@tegefaulkes tegefaulkes self-assigned this Sep 30, 2022
@ghost
Copy link

ghost commented Sep 30, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

@tegefaulkes
Copy link
Contributor Author

This is at the stage for review now.

@CMCDragonkai
Copy link
Member

Did you update the the intervals to 100ms?

Also this is only for network right? And are we applying this to everywhere where it is applicable to the networking:

  1. Establishing connections
  2. The endGracefully which should have its timeout replaced too
  3. ConnectionReverse.compose
  4. ConnectionReverse.start
  5. Proxy.connectForward
  6. Proxy.connectReverse
  7. Proxy.openConnectionForward
  8. Proxy.openConnectionReverse

I want to ensure that timerStart and timerStop utilities are no longer being used, and instead replaced with the Timer class.

@CMCDragonkai
Copy link
Member

Can you start add new tests to test the new timer usage. As well as explicit cancellation signals that should be stopping all side effects.

@CMCDragonkai
Copy link
Member

Robust testing needs to be added to test if cancelled connections are gracefully handled from both the forward and reverse side of the connection.

You need to start on this.

@tegefaulkes
Copy link
Contributor Author

Yes this was applied to the whole network domain. I've updated everything that had a timeout/timer.

@tegefaulkes
Copy link
Contributor Author

The networking is like an onion, there are many layers that makes it annoying to thing about. For a robust system each of these layers needs to properly handle timeouts and potential errors. For now I'm reviewing the network architecture to get a clear understanding of these layers.

At almost every layer we will need to handle the following failure conditions.

  1. SRC connection fails to connect to DST due to DST not responding at all
  2. SRC connection fails to connect to DST due to DST rejecting connection
  3. SRC starts connecting to DST but DST fails to fully establish.
  4. SRC starts connecting to DST but SRC rejects part way.
  5. SRC starts connecting to DST but SRC times out part way.
  6. SRC connects to DST but SRC stops responding
  7. Connection fully succeeds but SRC stops responding
  8. Connection fully succeeds but SRC forcibly closes the connection
  9. Connection fully succeeds but DST stops responding
  10. Connection fully succeeds but DST forcibly closes the connection

Changes in #445 applied a shorter timeout to nodeManager.ping which resulted in some new errors from the connections. #445 (comment). Ideally we will reveal the source of these problems by expanding the tests here.

@tegefaulkes
Copy link
Contributor Author

From the Proxy's perspective the first layer is a HTTP connection. When this HTTP connection is received, a forward connection is created if it doesn't already exist. It is then composed by piping data between the HTTP socket and the forward connection.

Two parts of the code here strike me as potential problems.

  1. Handling for the HTTP client ending connection isn't set up until the connection is composed. This could lead to a race condition where the connection is ended before we compose the connection. This would result in the connection not gracefully ending here. This also means we don't set this up if the forward connection fails to be created.
  2. To properly end a connection we should signal end, keep reading any data until the other side signals end, then we can destroy the socket. We have a method in connectionForward and connectionReverse for this already called endGracefully. We should be doing this for the HTTP connection as well.
  3. the connectionForward.compose method sets up an end event, but it doesn't await for clientSocket.end to finish before destroying the connection. This wil likely lead to a race conditon,

Another thing to note, the HTTP clientSocket can provide it's own timeout value, should be be using this to some degree, especially if it's shorter than our default?

@tegefaulkes
Copy link
Contributor Author

tegefaulkes commented Oct 4, 2022

tests for HTTP side failures.

  1. HTTP client connects and immediately ends forcibly.
  2. HTTP client connects and immediately ends gracefully.
  3. HTTP client connects with a set timeout and times out immediately.
  4. HTTP client connects and just stops responding
  5. HTTP client connects and ends forcibly after composing.
  6. HTTP client connects and ends gracefully after composing.
  7. HTTP client connects with a set timeout and times out after composing.
  8. HTTP client connects and just stops responding.

@tegefaulkes
Copy link
Contributor Author

Looks like the HTTP client doesn't fully establish connection until the ConnectionForward is created and composed. The handleConnectForward doesn't send the 200 code until then. This means the above points 1 and 2 are moot. Point 3 still stands.

As for the tests. It essentially means we can't do anything to the HTTP clientSocket directly until the forward connection if created and composed. That precludes us from ending or destroying the clientSocket early. that just leaves us 'critical' failures where the HTTP client just stops responding at certain stages.

I think this means the only error a failure on the HTTP client side could cause is a error when writing to it. This would be when we send the 200 code after setting up. This means we only need to add 1 test for this, but it will have to make use of a exec to run it in a separate process and kill it.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 4, 2022

Yes this was applied to the whole network domain. I've updated everything that had a timeout/timer.

Yea it looks like the Timer from src/types.ts is still being used in:

  • src/PolykeyClient.ts
  • GRPCClientAgent.ts
  • GRPCClientClient.ts
  • GRPCClient.ts
  • ConnectionForward.ts
  • NodeConnection.ts
  • NodeConnectionManager.ts
  • utils.ts
  • GRPCClientTest.ts

Most of that is GRPC related changes, and that's coming in the next PR right?

However the NodeConnection.ts and NodeConnectionManager.ts should be changed in this PR since this those usages are dependent on the network domain. I suppose this goes into #465?

@tegefaulkes
Copy link
Contributor Author

I'm having trouble re-creating the connection bug from #445. Since it's a race condition style bug, it may just be harder to tigger on my VM. Since this mostly affects the nodes domain, I'm going to defer fixing it to issue #465. Writing a test for it is going to very tricky since it depends heavily on timing. We may not be able to create a consistent test for it.

@CMCDragonkai
Copy link
Member

Some of the existing network tests simulate a UTP client/server and others simulate a GRPC client/server. Is it possible to use this to reproduce #445?

@tegefaulkes
Copy link
Contributor Author

It's possible, but it's going to take a bit more work to narrow down the exact conditions causing the problem before I can create a test for it. Right now it's delaying this PR but is a separate enough problem that it can be addressed separately. Given the difficulty we've had with network problems like this in the past, it could be it's own issue.

I'd like to get this PR cleaned, reviewed and merged.

@CMCDragonkai
Copy link
Member

Ok go ahead. You should then create a separate issue to track that bug then. But it really should have instructions as to how to reproduce that bug. Otherwise there's not much context to go off from.

@tegefaulkes
Copy link
Contributor Author

New issue create at #473.

@tegefaulkes
Copy link
Contributor Author

Changes:

  • The hole punch packet interval has been changed from 1000ms to 50ms.
  • ConnectionReverse and ConnectionReverse has been update with
    • start takes ContextTimed parameter for cancellation.
    • endGracefully uses timedCancellable with endTime as default timeout.
    • compose takes ContextTimed parameter for cancellation.
  • Proxy changes
    • the forward and reverse connection 'address -> Lock' map was replaced with LockBox.
    • openConnectionForward was updated with timedCancellable to support cancellation.
    • connectForward was updated with timedCancellable to support cancellation.
    • establishConnectionForward takes ContextTimed parameter for cancellation..
    • openConnectionReverse was updated with timedCancellable to support cancellation.
    • connectReverse was updated with timedCancellable to support cancellation.
    • establishConnectionReverse takes ContextTimed parameter for cancellation.
  • NodeConnectionManager had some temp changes to interface with the above changes.

@tegefaulkes tegefaulkes marked this pull request as ready for review October 5, 2022 06:38
@tegefaulkes tegefaulkes merged commit 06df14a into staging Oct 5, 2022
@CMCDragonkai
Copy link
Member

As clarified in the meeting today, the proxy shouldn't be handling failures in the client. The client and proxy trust each other. If the GRPC client fails to respond, that's the client's responsibility to deal with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Update Proxy domain to use timedCancellable
2 participants