awc reuses connections with connection errors from H2

When an `awc` client makes repeated requests to a single downstream service using an optimistic timeout, it can be expected these timeouts will unfortunately trigger rather regularly. When `awc` enforces a timeout it tells `h2` to reset the stream.

`h2` has a "num_local_error_resets" it keeps track of towards a default limit of 1024. Once this limit is reached `h2` will inject `GoAway` errors into any attempt to use that connection going forward.

## Expected Behavior

It is expected when attempting to re-use a connection `awc` will observe this `GoAway` (returned from either `io.poll_ready` and `io.send_request`, both in `send_request` in `h2proto.rs` and decide _not_ to return the connection to the pool by call `io.on_release(true)`.

This way, once the connection is effectively poisoned by `h2` further requests with the client will create a new connection and no further issues (beyond the timeouts) will be experienced by the user.

## Current Behavior

Currently when handling errors from either `io.poll_ready` or `io.send_request` `io.on_release` is called with the parameter `err.is_io()`, which means errors always return the connection the pool unless it was an IO error.

`GoAway` errors are not an IO error, so the connection is returned to the pool and immediately re-used for the next request, which fails in microseconds as `io.poll_ready` immediately returns the GoAway error. This causes the client to start experiencing repeated instantaneous errors when trying to repeatedly request downstream.

This invalid state can take a long time to recover. The `conn_lifetime` from the connector being one mechanism to put a cap on the possible duration of the invalid state, but setting it too low is rather unfortunate and creates otherwise unnecessary connection pressure in high performance systems and doesn't stop the invalid state from existing, just reduces its impact. Experimenting with high `conn_lifetime` values around 1 hour (as in my case I'm making server-to-server calls and expect very solid connections, I see no need for a lifetime limit at all) I do see it seems these errors happen regularly and the invalid state was observed to persist for sometimes 15 minutes, sometimes 45 minutes, and seems likely a function of how long into the connection lifetime it took for the connection to be poisoned by `h2`.

## Possible Solution

I _think_ this may be as easy as changing the two error handling `on_release` calls in `send_request` in `h2proto.rs` from:
```rust
io.on_release(err.is_io());
```
to
```rust
io.on_release(err.is_io() || err.is_go_away());
```

That said, I'm certainly not a protocol expert and don't know if this would be overly pessimistic and cause otherwise-healthy connections to be dropped, and there could certainly be other mechanisms that need to  be considered that I am unaware of.

I would be happy to create this PR myself but for my above mentioned lack of domain expertise, as I don't feel confident I know of all the potential knock-on effects.

## Steps to Reproduce (for bugs)

1. Make continuous requests to the same downstream system with some timing variance (due to networking, load, caching, etc).
2. Use a tight-ish timeout (8ms in my case) that will _sometimes_ be exceeded. Set this timeout both on the `ClientRequest` object _and_ the `ClientResponse` to respect upstream response-time SLAs.
3. Ensure either enough volume is generated and/or the `conn_lifetime` is high enough that >1024 timeouts occur during a connection lifetime. If logging is enabled you will see a log from `h2` that will indicate the error condition has been triggered, e.g. `WARN h2::proto::streams::streams:actix-rt|system:0|arbiter:4] locally-reset streams reached limit (1024)`
4. Observe that future connections after the 1024th timeout can fail _immediately_ (microseconds-to-fail rather than 8ms). Due to potentially having multiple open HTTP/2 connections it's not necessarily true that _all_ requests will fail, but failures will be proportional to the number of connections versus how many are poisoned.

## Context

I maintain a service which calls multiple downstream services server-to-server very regularly using HTTP/2. This service transacts millions of QPS and needs to perform very well to keep costs down.

Upstream and downstream of my application are both first-party services from the same organization and connections are usually very healthy. I have been attempting to migrate away from `reqwest` as the service is built on `actix-web` and I have seen performance issues using `reqwest` as I believe that `reqwest` doesn't follow the cooperative programming principles required for working in a thread-per-core framework like actix (principally how `reqwest` / `hyper` backgrounds tasks while awaiting a new HTTP/2 connection).

In my transition to `awc` I have immediately seen _slightly_ better tail latency times but I didn't manage to test under proper production load as the aforementioned issues with invalid connection reuse appeared and has derailed the test with an unfortunately unacceptable increase in error rates. If I can resolve this issue I believe `awc` will cooperate with the application much better than `reqwest`, particularly in heavily loaded scenarios.

## Your Environment

dockerized within a debian:bullseye-slim container.
Rust: 1.88
awc: 3.8.0
actix-web: 4.11.0
actix-rt: 2.10.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

awc reuses connections with connection errors from H2 #3788

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

awc reuses connections with connection errors from H2 #3788

Description

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions