-
-
Notifications
You must be signed in to change notification settings - Fork 333
Description
poll2() tight loop can bypass Tokio cooperative scheduling — exploring yield-between-frames for streaming workloads
Summary
On long-lived streaming connections (SSE, gRPC streaming, IoT feeds), I'm seeing a bimodal inter-frame latency pattern with h2: frames delivered within a TCP burst arrive in microseconds, but the gap between bursts can be hundreds of milliseconds.
From reading the code path, a likely contributor is that Connection::poll2() will process all already-buffered frames in one poll loop before yielding back to the runtime. In bursty conditions this can:
- monopolize the connection task for the duration of the burst,
- delay
poll_complete()(and therefore flushing WINDOW_UPDATEs), - reduce fairness vs other tasks on the same Tokio runtime.
This is great for throughput / request-response, but can be awkward for high-frequency small-message streaming where tail latency matters more than peak throughput. I'm exploring whether an opt-in or ecosystem-standard yielding approach makes sense here, and would love maintainer guidance on preferred direction.
Evidence
All runs use tcp_nodelay(true) (so this shouldn't be Nagle-related; cf. hyper #3187 discussions). Same sender, same wire data.
Benchmark results (200s, 324K+ samples each)
Measured inter-arrival time between consecutive body.frame().await returns on the receiver.
| HTTP/2 Library | p50 latency | p95 latency | Max gap | Jitter (σ) |
|---|---|---|---|---|
| h2 0.4.x | 1 μs | 1,242 μs | 440,904 μs | 36,692 μs |
| proxygen (C++) | 6 μs | 72 μs | 101,360 μs | 3,542 μs |
Key observation: h2's p50 is faster (1 μs vs 6 μs), but tail is worse: p95 ~17× and max gap ~4× under this burst pattern.
Environment & methodology
- h2: 0.4.13, hyper: 1.8.1
- Tokio: multi-threaded runtime
- OS: macOS 14 (arm64) and Linux (similar behavior)
- Workload: 800 Hz sensor stream over USB (low latency / high bandwidth tends to amplify TCP burstiness)
- Measurement: monotonic clock; inter-arrival between consecutive
body.frame().awaitreturns; stable across runs
Analysis (what I think is happening)
In src/proto/connection.rs, Connection::poll2() runs a tight loop and continues as long as codec.poll_next(cx) is Ready:
fn poll2(&mut self, cx: &mut Context) -> Poll<Result<(), Error>> {
loop {
match self.inner.as_dyn()
.recv_frame(ready!(Pin::new(&mut self.codec).poll_next(cx)?))?
{
ReceivedFrame::Continue => (),
ReceivedFrame::Done => return Poll::Ready(Ok(())),
// ...
}
}
}When TCP delivers a burst containing N complete HTTP/2 frames, a single socket read can fill the codec buffer. After that, multiple poll_next() calls may decode frames from memory without additional I/O. In that situation, Tokio's normal cooperative scheduling pressure that happens around repeated poll_read calls may not kick in mid-burst, so poll2() can drain the entire burst before yielding.
Secondary contributor: WINDOW_UPDATE timing
Separately, WINDOW_UPDATEs are intentionally batched (e.g. UNCLAIMED_DENOMINATOR = 2), and they're flushed from poll_complete(). If poll2() stays in its loop for a long burst, poll_complete() (and thus WINDOW_UPDATE emission) is delayed until the loop finishes. For streaming, yielding more frequently could improve flow-control responsiveness.
Why "typical" gRPC streaming may not show this
At lower rates (e.g., <100 Hz) and/or with larger messages, each poll tends to process only a small number of frames, so you don't hit the "many small frames coalesced into one read buffer" regime. This seems to show up most strongly with high-frequency, small DATA frames.
This pattern matches what the tonic project discussed/changed around streaming tail latency (frame batching + yielding).
Possible approaches
Option A (preferred to discuss): participate in Tokio cooperative scheduling (no new API)
Use Tokio's coop budget mechanism to occasionally yield from the loop:
fn poll2(&mut self, cx: &mut Context) -> Poll<Result<(), Error>> {
loop {
ready!(tokio::task::coop::poll_proceed(cx));
match self.inner.as_dyn()
.recv_frame(ready!(Pin::new(&mut self.codec).poll_next(cx)?))?
{
ReceivedFrame::Continue => (),
ReceivedFrame::Done => return Poll::Ready(Ok(())),
// ...
}
}
}poll_proceed is a public API in Tokio, gated behind the rt feature (which h2 already enables). This is the idiomatic way to avoid tight-loop monopolization in Tokio-based crates, without adding any user-facing knobs.
If maintainers prefer to keep h2 more runtime-agnostic, a manual budget counter with cx.waker().wake_by_ref() + Poll::Pending after N iterations would achieve the same effect without the Tokio coupling — though the budget would be disconnected from Tokio's actual task-level coop budget.
Option B: configurable limit (more control, more invasive)
A per-connection configuration like:
builder.max_data_frames_per_poll(1);Caveat: today ReceivedFrame::Continue does not distinguish DATA vs control frames, so implementing "count DATA frames" likely requires threading frame-type info out of recv_frame() or adding a dedicated variant—more invasive than Option A.
Trade-offs / concerns (why I'm asking before coding)
- Throughput vs latency: current behavior is throughput-optimal; yielding introduces scheduling overhead and increases total burst drain time.
- Fairness is best-effort: even if we return
Pending, the task may be re-polled quickly if the runtime is otherwise idle. - API surface: Option B adds knobs; Option A avoids that.
Related work / precedent
- tonic: streaming tail latency discussion and fixes (e.g. #1375 / #1423)
- Tokio:
coop::poll_proceedintended for tight polling loops (e.g. #4498 / #7405) - h2 precedent for per-connection knobs: Implement server::Builder::max_send_buffer_size #577 (
max_send_buffer_size)
Next steps
If maintainers think this is worth addressing, I'm happy to:
- start with a focused benchmark / regression test that reproduces the bimodal distribution, then
- send a PR for Option A (or whatever approach you prefer), along with throughput regression numbers.
I'm also very open to being told "this is by design" for h2's goals, or to alternative suggestions consistent with h2's architecture.