-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: network cancellability and deadlines #468
Conversation
👇 Click on the image for a new way to code review
Legend |
This is at the stage for review now. |
Did you update the the intervals to 100ms? Also this is only for
I want to ensure that |
Can you start add new tests to test the new timer usage. As well as explicit cancellation signals that should be stopping all side effects. |
You need to start on this. |
Yes this was applied to the whole |
The networking is like an onion, there are many layers that makes it annoying to thing about. For a robust system each of these layers needs to properly handle timeouts and potential errors. For now I'm reviewing the network architecture to get a clear understanding of these layers. At almost every layer we will need to handle the following failure conditions.
Changes in #445 applied a shorter timeout to |
From the Proxy's perspective the first layer is a HTTP connection. When this HTTP connection is received, a forward connection is created if it doesn't already exist. It is then composed by piping data between the HTTP socket and the forward connection. Two parts of the code here strike me as potential problems.
Another thing to note, the HTTP clientSocket can provide it's own timeout value, should be be using this to some degree, especially if it's shorter than our default? |
tests for HTTP side failures.
|
Looks like the As for the tests. It essentially means we can't do anything to the HTTP I think this means the only error a failure on the HTTP client side could cause is a error when writing to it. This would be when we send the 200 code after setting up. This means we only need to add 1 test for this, but it will have to make use of a |
Yea it looks like the
Most of that is GRPC related changes, and that's coming in the next PR right? However the |
I'm having trouble re-creating the connection bug from #445. Since it's a race condition style bug, it may just be harder to tigger on my VM. Since this mostly affects the nodes domain, I'm going to defer fixing it to issue #465. Writing a test for it is going to very tricky since it depends heavily on timing. We may not be able to create a consistent test for it. |
Some of the existing network tests simulate a UTP client/server and others simulate a GRPC client/server. Is it possible to use this to reproduce #445? |
It's possible, but it's going to take a bit more work to narrow down the exact conditions causing the problem before I can create a test for it. Right now it's delaying this PR but is a separate enough problem that it can be addressed separately. Given the difficulty we've had with network problems like this in the past, it could be it's own issue. I'd like to get this PR cleaned, reviewed and merged. |
Ok go ahead. You should then create a separate issue to track that bug then. But it really should have instructions as to how to reproduce that bug. Otherwise there's not much context to go off from. |
New issue create at #473. |
Changes:
|
db9945f
to
af97f89
Compare
af97f89
to
6d0b677
Compare
As clarified in the meeting today, the proxy shouldn't be handling failures in the client. The client and proxy trust each other. If the GRPC client fails to respond, that's the client's responsibility to deal with. |
Description
This PR focuses on updating the network domain with
timedCancellable
, general cancellability and deadlines.Issues Fixed
timedCancellable
across the board to control how long side-effects are allowed to complete #450Proxy
domain to usetimedCancellable
#464Tasks
timedCancellable
decorator.LockBox
.Add tests to check for the problem in Integrate TaskManager into NodeGraph and Discovery #445 (comment) . the problem should also be fixed.New issue at Fix connection bugs resulting from short timeout onNodeManager.pingNode
#473Final checklist