-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
When an awc client makes repeated requests to a single downstream service using an optimistic timeout, it can be expected these timeouts will unfortunately trigger rather regularly. When awc enforces a timeout it tells h2 to reset the stream.
h2 has a "num_local_error_resets" it keeps track of towards a default limit of 1024. Once this limit is reached h2 will inject GoAway errors into any attempt to use that connection going forward.
Expected Behavior
It is expected when attempting to re-use a connection awc will observe this GoAway (returned from either io.poll_ready and io.send_request, both in send_request in h2proto.rs and decide not to return the connection to the pool by call io.on_release(true).
This way, once the connection is effectively poisoned by h2 further requests with the client will create a new connection and no further issues (beyond the timeouts) will be experienced by the user.
Current Behavior
Currently when handling errors from either io.poll_ready or io.send_request io.on_release is called with the parameter err.is_io(), which means errors always return the connection the pool unless it was an IO error.
GoAway errors are not an IO error, so the connection is returned to the pool and immediately re-used for the next request, which fails in microseconds as io.poll_ready immediately returns the GoAway error. This causes the client to start experiencing repeated instantaneous errors when trying to repeatedly request downstream.
This invalid state can take a long time to recover. The conn_lifetime from the connector being one mechanism to put a cap on the possible duration of the invalid state, but setting it too low is rather unfortunate and creates otherwise unnecessary connection pressure in high performance systems and doesn't stop the invalid state from existing, just reduces its impact. Experimenting with high conn_lifetime values around 1 hour (as in my case I'm making server-to-server calls and expect very solid connections, I see no need for a lifetime limit at all) I do see it seems these errors happen regularly and the invalid state was observed to persist for sometimes 15 minutes, sometimes 45 minutes, and seems likely a function of how long into the connection lifetime it took for the connection to be poisoned by h2.
Possible Solution
I think this may be as easy as changing the two error handling on_release calls in send_request in h2proto.rs from:
io.on_release(err.is_io());to
io.on_release(err.is_io() || err.is_go_away());That said, I'm certainly not a protocol expert and don't know if this would be overly pessimistic and cause otherwise-healthy connections to be dropped, and there could certainly be other mechanisms that need to be considered that I am unaware of.
I would be happy to create this PR myself but for my above mentioned lack of domain expertise, as I don't feel confident I know of all the potential knock-on effects.
Steps to Reproduce (for bugs)
- Make continuous requests to the same downstream system with some timing variance (due to networking, load, caching, etc).
- Use a tight-ish timeout (8ms in my case) that will sometimes be exceeded. Set this timeout both on the
ClientRequestobject and theClientResponseto respect upstream response-time SLAs. - Ensure either enough volume is generated and/or the
conn_lifetimeis high enough that >1024 timeouts occur during a connection lifetime. If logging is enabled you will see a log fromh2that will indicate the error condition has been triggered, e.g.WARN h2::proto::streams::streams:actix-rt|system:0|arbiter:4] locally-reset streams reached limit (1024) - Observe that future connections after the 1024th timeout can fail immediately (microseconds-to-fail rather than 8ms). Due to potentially having multiple open HTTP/2 connections it's not necessarily true that all requests will fail, but failures will be proportional to the number of connections versus how many are poisoned.
Context
I maintain a service which calls multiple downstream services server-to-server very regularly using HTTP/2. This service transacts millions of QPS and needs to perform very well to keep costs down.
Upstream and downstream of my application are both first-party services from the same organization and connections are usually very healthy. I have been attempting to migrate away from reqwest as the service is built on actix-web and I have seen performance issues using reqwest as I believe that reqwest doesn't follow the cooperative programming principles required for working in a thread-per-core framework like actix (principally how reqwest / hyper backgrounds tasks while awaiting a new HTTP/2 connection).
In my transition to awc I have immediately seen slightly better tail latency times but I didn't manage to test under proper production load as the aforementioned issues with invalid connection reuse appeared and has derailed the test with an unfortunately unacceptable increase in error rates. If I can resolve this issue I believe awc will cooperate with the application much better than reqwest, particularly in heavily loaded scenarios.
Your Environment
dockerized within a debian:bullseye-slim container.
Rust: 1.88
awc: 3.8.0
actix-web: 4.11.0
actix-rt: 2.10.0