Skip to content

Streaming latency spikes: receive loop processes all buffered frames before yielding #879

@SeaOtocinclus

Description

@SeaOtocinclus

poll2() tight loop can bypass Tokio cooperative scheduling — exploring yield-between-frames for streaming workloads

Summary

On long-lived streaming connections (SSE, gRPC streaming, IoT feeds), I'm seeing a bimodal inter-frame latency pattern with h2: frames delivered within a TCP burst arrive in microseconds, but the gap between bursts can be hundreds of milliseconds.

From reading the code path, a likely contributor is that Connection::poll2() will process all already-buffered frames in one poll loop before yielding back to the runtime. In bursty conditions this can:

  • monopolize the connection task for the duration of the burst,
  • delay poll_complete() (and therefore flushing WINDOW_UPDATEs),
  • reduce fairness vs other tasks on the same Tokio runtime.

This is great for throughput / request-response, but can be awkward for high-frequency small-message streaming where tail latency matters more than peak throughput. I'm exploring whether an opt-in or ecosystem-standard yielding approach makes sense here, and would love maintainer guidance on preferred direction.

Evidence

All runs use tcp_nodelay(true) (so this shouldn't be Nagle-related; cf. hyper #3187 discussions). Same sender, same wire data.

Benchmark results (200s, 324K+ samples each)

Measured inter-arrival time between consecutive body.frame().await returns on the receiver.

HTTP/2 Library p50 latency p95 latency Max gap Jitter (σ)
h2 0.4.x 1 μs 1,242 μs 440,904 μs 36,692 μs
proxygen (C++) 6 μs 72 μs 101,360 μs 3,542 μs

Key observation: h2's p50 is faster (1 μs vs 6 μs), but tail is worse: p95 ~17× and max gap ~4× under this burst pattern.

Environment & methodology
  • h2: 0.4.13, hyper: 1.8.1
  • Tokio: multi-threaded runtime
  • OS: macOS 14 (arm64) and Linux (similar behavior)
  • Workload: 800 Hz sensor stream over USB (low latency / high bandwidth tends to amplify TCP burstiness)
  • Measurement: monotonic clock; inter-arrival between consecutive body.frame().await returns; stable across runs

Analysis (what I think is happening)

In src/proto/connection.rs, Connection::poll2() runs a tight loop and continues as long as codec.poll_next(cx) is Ready:

fn poll2(&mut self, cx: &mut Context) -> Poll<Result<(), Error>> {
    loop {
        match self.inner.as_dyn()
            .recv_frame(ready!(Pin::new(&mut self.codec).poll_next(cx)?))?
        {
            ReceivedFrame::Continue => (),
            ReceivedFrame::Done => return Poll::Ready(Ok(())),
            // ...
        }
    }
}

When TCP delivers a burst containing N complete HTTP/2 frames, a single socket read can fill the codec buffer. After that, multiple poll_next() calls may decode frames from memory without additional I/O. In that situation, Tokio's normal cooperative scheduling pressure that happens around repeated poll_read calls may not kick in mid-burst, so poll2() can drain the entire burst before yielding.

Secondary contributor: WINDOW_UPDATE timing

Separately, WINDOW_UPDATEs are intentionally batched (e.g. UNCLAIMED_DENOMINATOR = 2), and they're flushed from poll_complete(). If poll2() stays in its loop for a long burst, poll_complete() (and thus WINDOW_UPDATE emission) is delayed until the loop finishes. For streaming, yielding more frequently could improve flow-control responsiveness.

Why "typical" gRPC streaming may not show this

At lower rates (e.g., <100 Hz) and/or with larger messages, each poll tends to process only a small number of frames, so you don't hit the "many small frames coalesced into one read buffer" regime. This seems to show up most strongly with high-frequency, small DATA frames.

This pattern matches what the tonic project discussed/changed around streaming tail latency (frame batching + yielding).

Possible approaches

Option A (preferred to discuss): participate in Tokio cooperative scheduling (no new API)

Use Tokio's coop budget mechanism to occasionally yield from the loop:

fn poll2(&mut self, cx: &mut Context) -> Poll<Result<(), Error>> {
    loop {
        ready!(tokio::task::coop::poll_proceed(cx));
        match self.inner.as_dyn()
            .recv_frame(ready!(Pin::new(&mut self.codec).poll_next(cx)?))?
        {
            ReceivedFrame::Continue => (),
            ReceivedFrame::Done => return Poll::Ready(Ok(())),
            // ...
        }
    }
}

poll_proceed is a public API in Tokio, gated behind the rt feature (which h2 already enables). This is the idiomatic way to avoid tight-loop monopolization in Tokio-based crates, without adding any user-facing knobs.

If maintainers prefer to keep h2 more runtime-agnostic, a manual budget counter with cx.waker().wake_by_ref() + Poll::Pending after N iterations would achieve the same effect without the Tokio coupling — though the budget would be disconnected from Tokio's actual task-level coop budget.

Option B: configurable limit (more control, more invasive)

A per-connection configuration like:

builder.max_data_frames_per_poll(1);

Caveat: today ReceivedFrame::Continue does not distinguish DATA vs control frames, so implementing "count DATA frames" likely requires threading frame-type info out of recv_frame() or adding a dedicated variant—more invasive than Option A.

Trade-offs / concerns (why I'm asking before coding)

  • Throughput vs latency: current behavior is throughput-optimal; yielding introduces scheduling overhead and increases total burst drain time.
  • Fairness is best-effort: even if we return Pending, the task may be re-polled quickly if the runtime is otherwise idle.
  • API surface: Option B adds knobs; Option A avoids that.

Related work / precedent

  • tonic: streaming tail latency discussion and fixes (e.g. #1375 / #1423)
  • Tokio: coop::poll_proceed intended for tight polling loops (e.g. #4498 / #7405)
  • h2 precedent for per-connection knobs: Implement server::Builder::max_send_buffer_size #577 (max_send_buffer_size)

Next steps

If maintainers think this is worth addressing, I'm happy to:

  • start with a focused benchmark / regression test that reproduces the bimodal distribution, then
  • send a PR for Option A (or whatever approach you prefer), along with throughput regression numbers.

I'm also very open to being told "this is by design" for h2's goals, or to alternative suggestions consistent with h2's architecture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions