Skip to content

Conversation

@antoniovicente
Copy link
Contributor

No description provided.

@antoniovicente antoniovicente requested a review from a team as a code owner November 22, 2025 03:00
@LPardue
Copy link
Contributor

LPardue commented Nov 22, 2025

Looks broadly good. It got me wondering if, while we're touching this area, maybe we should do a little more refactor to ensure the time metric is reported on stream teardown (whatever the cause)? i.e. on destruction if the time is Some e.g.

                    if let Some(first) =
                        ctx.first_full_headers_flush_fail_time.take()
                    {
                        ctx.audit_stats.add_header_flush_duration(
                            Instant::now().duration_since(first),
                        );
                    }

If so, we'd want to take avoid putting self.headers_pending_flush.fetch_sub(1, Ordering::SeqCst); in add_header_flush_duration()

/// exited.
///
/// This count includes streams that were reset before headers were flushed.
headers_pending_flush: AtomicU64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can only ever be a maximum of 1? While blocked on HEADERS, no other frame will be attempted. If the frame succeeds, the count is decremented, before attempting any subsequent HEADERS. Maybe easier to convert it to a bool and let apps do counting across all streams based on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to bool.

@antoniovicente
Copy link
Contributor Author

Looks broadly good. It got me wondering if, while we're touching this area, maybe we should do a little more refactor to ensure the time metric is reported on stream teardown (whatever the cause)? i.e. on destruction if the time is Some e.g.

                    if let Some(first) =
                        ctx.first_full_headers_flush_fail_time.take()
                    {
                        ctx.audit_stats.add_header_flush_duration(
                            Instant::now().duration_since(first),
                        );
                    }

If so, we'd want to take avoid putting self.headers_pending_flush.fetch_sub(1, Ordering::SeqCst); in add_header_flush_duration()

Reporting a duration on other teardown reasons seems fine to me, but I think we'ld like to also indicate if the flush succeeded or the stream was cancelled. We do not want to mix together the success and failure cases.

Ultimately we want to detect cases where the connection has data to send but cannot make progress. The reasons why we may not be able to make progress include flow control, congestion control or other reasons.

@LPardue
Copy link
Contributor

LPardue commented Nov 22, 2025

Right. The duration is cumulative across all HEADERS, while being actively blocked can happen only on an individual HEADERS. For example, 103 early hints, 200 OK, trailers. The audit stats already capture if the stream was stopped/reset/fin, apps can use that information in combination with your new metric introduced in the PR to figure out the terminal status.

@antoniovicente
Copy link
Contributor Author

I think there's a difference between the case that the headers were sent after some delay and the case headers were never sent after a delay. Even if the client action is to reset the stream before FIN is sent.

@LPardue
Copy link
Contributor

LPardue commented Nov 22, 2025

I agree, that's why I think it would be better to switch the metric proposed here to a bool that indicates if HEADERS were blocked, or not, at the time the stream is terminated.

@antoniovicente antoniovicente force-pushed the antonio/pending_headers branch 2 times, most recently from 4bbe4a7 to 07104cf Compare November 22, 2025 22:03
@antoniovicente
Copy link
Contributor Author

Looks broadly good. It got me wondering if, while we're touching this area, maybe we should do a little more refactor to ensure the time metric is reported on stream teardown (whatever the cause)? i.e. on destruction if the time is Some e.g.

                    if let Some(first) =
                        ctx.first_full_headers_flush_fail_time.take()
                    {
                        ctx.audit_stats.add_header_flush_duration(
                            Instant::now().duration_since(first),
                        );
                    }

If so, we'd want to take avoid putting self.headers_pending_flush.fetch_sub(1, Ordering::SeqCst); in add_header_flush_duration()

Attempted to record blocked time in cases where the stream terminates. I'll look into some more testing on Monday.

@antoniovicente antoniovicente force-pushed the antonio/pending_headers branch from b875595 to 84430ca Compare December 12, 2025 20:29
@antoniovicente antoniovicente force-pushed the antonio/pending_headers branch from 7aef8a6 to 34db457 Compare December 15, 2025 21:46
@antoniovicente antoniovicente enabled auto-merge (squash) December 15, 2025 21:47

let audit_stats = audit_stats.read().unwrap().clone().unwrap();
assert!(audit_stats.headers_pending_flush());
// Verify that headers_flush_duration is updated on connection drop.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion in the next line is flaky according to CI; failed in mac and window builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a sleep doesn't seem to mitigate the flakiness. I need to debug further.

.set_recvd_stream_fin(StreamClosureKind::Implicit);

// Update stats if there were pending header sends on this stream.
stream.full_headers_flush_aborted();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix for flaky test could be to move the contents of this function to IoWorker::close_connection and wait until metrics.connections_in_memory is 0 before making assertions in the test.

The request in the original test was never getting a response.  After 2 seconds, the client
used to close the connection due to the configured timeout.
@antoniovicente antoniovicente force-pushed the antonio/pending_headers branch from 121207b to cee4429 Compare December 17, 2025 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants