Streaming latency spikes: receive loop processes all buffered frames before yielding

`poll2()` tight loop can bypass Tokio cooperative scheduling — exploring yield-between-frames for streaming workloads

## Summary

On long-lived streaming connections (SSE, gRPC streaming, IoT feeds), I'm seeing a **bimodal inter-frame latency** pattern with `h2`: frames delivered within a TCP burst arrive in microseconds, but the gap between bursts can be hundreds of milliseconds.

From reading the code path, a likely contributor is that `Connection::poll2()` will process **all already-buffered frames in one poll loop** before yielding back to the runtime. In bursty conditions this can:
- monopolize the connection task for the duration of the burst,
- delay `poll_complete()` (and therefore flushing WINDOW_UPDATEs),
- reduce fairness vs other tasks on the same Tokio runtime.

This is great for throughput / request-response, but can be awkward for high-frequency small-message streaming where tail latency matters more than peak throughput. I'm exploring whether an opt-in or ecosystem-standard yielding approach makes sense here, and would love maintainer guidance on preferred direction.

## Evidence

All runs use `tcp_nodelay(true)` (so this shouldn't be Nagle-related; cf. hyper #3187 discussions). Same sender, same wire data.

<details>
<summary>Benchmark results (200s, 324K+ samples each)</summary>

Measured inter-arrival time between consecutive `body.frame().await` returns on the receiver.

| HTTP/2 Library | p50 latency | p95 latency | Max gap | Jitter (σ) |
|---------------|-------------|-------------|---------|------------|
| **h2 0.4.x** | **1 μs** | 1,242 μs | 440,904 μs | 36,692 μs |
| proxygen (C++) | 6 μs | **72 μs** | 101,360 μs | 3,542 μs |

Key observation: **h2's p50 is faster** (1 μs vs 6 μs), but tail is worse: **p95 ~17×** and **max gap ~4×** under this burst pattern.

</details>

<details>
<summary>Environment & methodology</summary>

- h2: 0.4.13, hyper: 1.8.1
- Tokio: multi-threaded runtime
- OS: macOS 14 (arm64) and Linux (similar behavior)
- Workload: 800 Hz sensor stream over USB (low latency / high bandwidth tends to amplify TCP burstiness)
- Measurement: monotonic clock; inter-arrival between consecutive `body.frame().await` returns; stable across runs

</details>

## Analysis (what I think is happening)

In `src/proto/connection.rs`, `Connection::poll2()` runs a tight loop and continues as long as `codec.poll_next(cx)` is `Ready`:

```rust
fn poll2(&mut self, cx: &mut Context) -> Poll<Result<(), Error>> {
    loop {
        match self.inner.as_dyn()
            .recv_frame(ready!(Pin::new(&mut self.codec).poll_next(cx)?))?
        {
            ReceivedFrame::Continue => (),
            ReceivedFrame::Done => return Poll::Ready(Ok(())),
            // ...
        }
    }
}
```

When TCP delivers a burst containing N complete HTTP/2 frames, a *single* socket read can fill the codec buffer. After that, multiple `poll_next()` calls may decode frames from memory without additional I/O. In that situation, Tokio's normal cooperative scheduling pressure that happens around repeated `poll_read` calls may not kick in mid-burst, so `poll2()` can drain the entire burst before yielding.

### Secondary contributor: WINDOW_UPDATE timing

Separately, WINDOW_UPDATEs are intentionally batched (e.g. `UNCLAIMED_DENOMINATOR = 2`), and they're flushed from `poll_complete()`. If `poll2()` stays in its loop for a long burst, `poll_complete()` (and thus WINDOW_UPDATE emission) is delayed until the loop finishes. For streaming, yielding more frequently could improve flow-control responsiveness.

### Why "typical" gRPC streaming may not show this

At lower rates (e.g., <100 Hz) and/or with larger messages, each poll tends to process only a small number of frames, so you don't hit the "many small frames coalesced into one read buffer" regime. This seems to show up most strongly with **high-frequency, small DATA frames**.

This pattern matches what the tonic project discussed/changed around streaming tail latency (frame batching + yielding).

## Possible approaches

### Option A (preferred to discuss): participate in Tokio cooperative scheduling (no new API)

Use Tokio's coop budget mechanism to occasionally yield from the loop:

```rust
fn poll2(&mut self, cx: &mut Context) -> Poll<Result<(), Error>> {
    loop {
        ready!(tokio::task::coop::poll_proceed(cx));
        match self.inner.as_dyn()
            .recv_frame(ready!(Pin::new(&mut self.codec).poll_next(cx)?))?
        {
            ReceivedFrame::Continue => (),
            ReceivedFrame::Done => return Poll::Ready(Ok(())),
            // ...
        }
    }
}
```

[`poll_proceed` is a public API](https://docs.rs/tokio/latest/tokio/task/coop/fn.poll_proceed.html) in Tokio, gated behind the `rt` feature (which h2 already enables). This is the idiomatic way to avoid tight-loop monopolization in Tokio-based crates, without adding any user-facing knobs.

If maintainers prefer to keep h2 more runtime-agnostic, a manual budget counter with `cx.waker().wake_by_ref()` + `Poll::Pending` after N iterations would achieve the same effect without the Tokio coupling — though the budget would be disconnected from Tokio's actual task-level coop budget.

### Option B: configurable limit (more control, more invasive)

A per-connection configuration like:

```rust
builder.max_data_frames_per_poll(1);
```

Caveat: today `ReceivedFrame::Continue` does not distinguish DATA vs control frames, so implementing "count DATA frames" likely requires threading frame-type info out of `recv_frame()` or adding a dedicated variant—more invasive than Option A.

## Trade-offs / concerns (why I'm asking before coding)

- **Throughput vs latency**: current behavior is throughput-optimal; yielding introduces scheduling overhead and increases total burst drain time.
- **Fairness is best-effort**: even if we return `Pending`, the task may be re-polled quickly if the runtime is otherwise idle.
- **API surface**: Option B adds knobs; Option A avoids that.

## Related work / precedent

- tonic: streaming tail latency discussion and fixes (e.g. #1375 / #1423)
- Tokio: `coop::poll_proceed` intended for tight polling loops (e.g. #4498 / #7405)
- h2 precedent for per-connection knobs: #577 (`max_send_buffer_size`)

## Next steps

If maintainers think this is worth addressing, I'm happy to:
- start with a focused benchmark / regression test that reproduces the bimodal distribution, then
- send a PR for Option A (or whatever approach you prefer), along with throughput regression numbers.

I'm also very open to being told "this is by design" for h2's goals, or to alternative suggestions consistent with h2's architecture.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Streaming latency spikes: receive loop processes all buffered frames before yielding #879

Summary

Evidence

Analysis (what I think is happening)

Secondary contributor: WINDOW_UPDATE timing

Why "typical" gRPC streaming may not show this

Possible approaches

Option A (preferred to discuss): participate in Tokio cooperative scheduling (no new API)

Option B: configurable limit (more control, more invasive)

Trade-offs / concerns (why I'm asking before coding)

Related work / precedent

Next steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HTTP/2 Library	p50 latency	p95 latency	Max gap	Jitter (σ)
h2 0.4.x	1 μs	1,242 μs	440,904 μs	36,692 μs
proxygen (C++)	6 μs	72 μs	101,360 μs	3,542 μs

Uh oh!

Streaming latency spikes: receive loop processes all buffered frames before yielding #879

Description

Summary

Evidence

Analysis (what I think is happening)

Secondary contributor: WINDOW_UPDATE timing

Why "typical" gRPC streaming may not show this

Possible approaches

Option A (preferred to discuss): participate in Tokio cooperative scheduling (no new API)

Option B: configurable limit (more control, more invasive)

Trade-offs / concerns (why I'm asking before coding)

Related work / precedent

Next steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions