Skip to content

tcp: anchor ACK-derived time measurements to ingress time, not processing time#13394

Open
raggi wants to merge 1 commit into
google:masterfrom
raggi:ack-timing
Open

tcp: anchor ACK-derived time measurements to ingress time, not processing time#13394
raggi wants to merge 1 commit into
google:masterfrom
raggi:ack-timing

Conversation

@raggi

@raggi raggi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Several TCP measurements that are causally anchored to the arrival of an inbound ACK were computed using the wall clock at the moment the segment was processed (ep.stack.Clock().NowMonotonic()) rather than the time the segment arrived at the stack (segment.rcvdTime, stamped in newIncomingSegment before the segment is queued).

These differ when an inbound ACK is delayed inside the stack before being processed. In particular, a busy sender holds the endpoint lock while Write synchronously flushes a window of segments (each paying real per-packet cost such as encryption and a tun write), so inbound ACKs sit unprocessed in the endpoint's segment queue. Measuring against processing time then inflates the result by that internal delay, corrupting:

  • RACK.RTT / RACK.minRTT and the reorder window (rackControl.update), which can trigger spurious loss detection and hold cwnd abnormally low on a path with little or no real loss;
  • the connection's SRTT/RTO via the timestamp and non-timestamp RTT samples in sender.handleRcvdSegment;
  • the initial RTO seeded at handshake completion (handshake.transitionToStateEstablishedLocked);
  • the receiver's RTT estimate used for receive-buffer autotuning (receiver.updateRTT), and lastRcvdAckTime;
  • CUBIC HyStart's ACK-train detection (now.Sub(LastAck)/now.Sub(RoundStart)), where a burst of ACKs drained together after an internal delay can look like a tight train and exit slow start prematurely.

This is the long-standing root cause behind reports of RACK performing poorly under load (spurious retransmissions, reduced congestion window). detectLoss already used the segment's rcvdTime; this change makes the rest of the ACK-anchored time measurements consistent with it, and threads the ACK ingress time through congestionControl.Update so CUBIC HyStart's ACK-train timing uses it too. updateRTO also gains a defensive guard against a negative RTT sample.

Anchoring RTT to ingress time is more in line with RFC 6298 ("measure the elapsed time between sending a data octet ... and receiving the acknowledgment") and RFC 8985, and matches Linux, which samples RTT at receive (softirq) time.

Note: CUBIC's congestion-event epoch timestamps (c.T in getCwnd and the loss/ recovery handlers) are intentionally left on processing time; they are event markers, not arrival-anchored estimators, and Linux uses processing time there as well.

Adds tcp_acktiming_test.go, which uses a manual clock and the StopWork hook to hold the endpoint lock while an ACK ages in the queue, asserting the RACK RTT and SRTT are not inflated by the internal delay; the timestamp and non-timestamp RTT paths are each exercised. Adds a CUBIC HyStart regression test that drives a "burst" of ACKs whose ingress times are spaced beyond ackDelta while the processing clock is frozen, asserting HyStart does not fire. Both fail without the corresponding fix and pass with it.

Fixes #9778
Updates tailscale/tailscale#9707

…sing time

Several TCP measurements that are causally anchored to the arrival of an
inbound ACK were computed using the wall clock at the moment the segment was
*processed* (ep.stack.Clock().NowMonotonic()) rather than the time the segment
*arrived* at the stack (segment.rcvdTime, stamped in newIncomingSegment before
the segment is queued).

These differ when an inbound ACK is delayed inside the stack before being
processed. In particular, a busy sender holds the endpoint lock while Write
synchronously flushes a window of segments (each paying real per-packet cost
such as encryption and a tun write), so inbound ACKs sit unprocessed in the
endpoint's segment queue. Measuring against processing time then inflates the
result by that internal delay, corrupting:

  - RACK.RTT / RACK.minRTT and the reorder window (rackControl.update), which
    can trigger spurious loss detection and hold cwnd abnormally low on a path
    with little or no real loss;
  - the connection's SRTT/RTO via the timestamp and non-timestamp RTT samples
    in sender.handleRcvdSegment;
  - the initial RTO seeded at handshake completion
    (handshake.transitionToStateEstablishedLocked);
  - the receiver's RTT estimate used for receive-buffer autotuning
    (receiver.updateRTT), and lastRcvdAckTime;
  - CUBIC HyStart's ACK-train detection (now.Sub(LastAck)/now.Sub(RoundStart)),
    where a burst of ACKs drained together after an internal delay can look
    like a tight train and exit slow start prematurely.

This is the long-standing root cause behind reports of RACK performing poorly
under load (spurious retransmissions, reduced congestion window). detectLoss
already used the segment's rcvdTime; this change makes the rest of the
ACK-anchored time measurements consistent with it, and threads the ACK ingress
time through congestionControl.Update so CUBIC HyStart's ACK-train timing uses
it too. updateRTO also gains a defensive guard against a negative RTT sample.

Anchoring RTT to ingress time is more in line with RFC 6298 ("measure the
elapsed time between sending a data octet ... and receiving the
acknowledgment") and RFC 8985, and matches Linux, which samples RTT at receive
(softirq) time.

Note: CUBIC's congestion-event epoch timestamps (c.T in getCwnd and the loss/
recovery handlers) are intentionally left on processing time; they are event
markers, not arrival-anchored estimators, and Linux uses processing time there
as well.

Adds tcp_acktiming_test.go, which uses a manual clock and the StopWork hook to
hold the endpoint lock while an ACK ages in the queue, asserting the RACK RTT
and SRTT are not inflated by the internal delay; the timestamp and
non-timestamp RTT paths are each exercised. Adds a CUBIC HyStart regression
test that drives a "burst" of ACKs whose ingress times are spaced beyond
ackDelta while the processing clock is frozen, asserting HyStart does not fire.
Both fail without the corresponding fix and pass with it.

Fixes google#9778
Updates tailscale/tailscale#9707
@raggi

raggi commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@nybidari would you be able to help route a review for me? thank you!

@nybidari

Copy link
Copy Markdown
Contributor

Thanks for the PR @raggi! I will review the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

netstack: performance w/TCP-RACK on Windows

2 participants