tcp: anchor ACK-derived time measurements to ingress time, not processing time#13394
Open
raggi wants to merge 1 commit into
Open
tcp: anchor ACK-derived time measurements to ingress time, not processing time#13394raggi wants to merge 1 commit into
raggi wants to merge 1 commit into
Conversation
…sing time
Several TCP measurements that are causally anchored to the arrival of an
inbound ACK were computed using the wall clock at the moment the segment was
*processed* (ep.stack.Clock().NowMonotonic()) rather than the time the segment
*arrived* at the stack (segment.rcvdTime, stamped in newIncomingSegment before
the segment is queued).
These differ when an inbound ACK is delayed inside the stack before being
processed. In particular, a busy sender holds the endpoint lock while Write
synchronously flushes a window of segments (each paying real per-packet cost
such as encryption and a tun write), so inbound ACKs sit unprocessed in the
endpoint's segment queue. Measuring against processing time then inflates the
result by that internal delay, corrupting:
- RACK.RTT / RACK.minRTT and the reorder window (rackControl.update), which
can trigger spurious loss detection and hold cwnd abnormally low on a path
with little or no real loss;
- the connection's SRTT/RTO via the timestamp and non-timestamp RTT samples
in sender.handleRcvdSegment;
- the initial RTO seeded at handshake completion
(handshake.transitionToStateEstablishedLocked);
- the receiver's RTT estimate used for receive-buffer autotuning
(receiver.updateRTT), and lastRcvdAckTime;
- CUBIC HyStart's ACK-train detection (now.Sub(LastAck)/now.Sub(RoundStart)),
where a burst of ACKs drained together after an internal delay can look
like a tight train and exit slow start prematurely.
This is the long-standing root cause behind reports of RACK performing poorly
under load (spurious retransmissions, reduced congestion window). detectLoss
already used the segment's rcvdTime; this change makes the rest of the
ACK-anchored time measurements consistent with it, and threads the ACK ingress
time through congestionControl.Update so CUBIC HyStart's ACK-train timing uses
it too. updateRTO also gains a defensive guard against a negative RTT sample.
Anchoring RTT to ingress time is more in line with RFC 6298 ("measure the
elapsed time between sending a data octet ... and receiving the
acknowledgment") and RFC 8985, and matches Linux, which samples RTT at receive
(softirq) time.
Note: CUBIC's congestion-event epoch timestamps (c.T in getCwnd and the loss/
recovery handlers) are intentionally left on processing time; they are event
markers, not arrival-anchored estimators, and Linux uses processing time there
as well.
Adds tcp_acktiming_test.go, which uses a manual clock and the StopWork hook to
hold the endpoint lock while an ACK ages in the queue, asserting the RACK RTT
and SRTT are not inflated by the internal delay; the timestamp and
non-timestamp RTT paths are each exercised. Adds a CUBIC HyStart regression
test that drives a "burst" of ACKs whose ingress times are spaced beyond
ackDelta while the processing clock is frozen, asserting HyStart does not fire.
Both fail without the corresponding fix and pass with it.
Fixes google#9778
Updates tailscale/tailscale#9707
Contributor
Author
|
@nybidari would you be able to help route a review for me? thank you! |
Contributor
|
Thanks for the PR @raggi! I will review the code. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Several TCP measurements that are causally anchored to the arrival of an inbound ACK were computed using the wall clock at the moment the segment was processed (ep.stack.Clock().NowMonotonic()) rather than the time the segment arrived at the stack (segment.rcvdTime, stamped in newIncomingSegment before the segment is queued).
These differ when an inbound ACK is delayed inside the stack before being processed. In particular, a busy sender holds the endpoint lock while Write synchronously flushes a window of segments (each paying real per-packet cost such as encryption and a tun write), so inbound ACKs sit unprocessed in the endpoint's segment queue. Measuring against processing time then inflates the result by that internal delay, corrupting:
This is the long-standing root cause behind reports of RACK performing poorly under load (spurious retransmissions, reduced congestion window). detectLoss already used the segment's rcvdTime; this change makes the rest of the ACK-anchored time measurements consistent with it, and threads the ACK ingress time through congestionControl.Update so CUBIC HyStart's ACK-train timing uses it too. updateRTO also gains a defensive guard against a negative RTT sample.
Anchoring RTT to ingress time is more in line with RFC 6298 ("measure the elapsed time between sending a data octet ... and receiving the acknowledgment") and RFC 8985, and matches Linux, which samples RTT at receive (softirq) time.
Note: CUBIC's congestion-event epoch timestamps (c.T in getCwnd and the loss/ recovery handlers) are intentionally left on processing time; they are event markers, not arrival-anchored estimators, and Linux uses processing time there as well.
Adds tcp_acktiming_test.go, which uses a manual clock and the StopWork hook to hold the endpoint lock while an ACK ages in the queue, asserting the RACK RTT and SRTT are not inflated by the internal delay; the timestamp and non-timestamp RTT paths are each exercised. Adds a CUBIC HyStart regression test that drives a "burst" of ACKs whose ingress times are spaced beyond ackDelta while the processing clock is frozen, asserting HyStart does not fire. Both fail without the corresponding fix and pass with it.
Fixes #9778
Updates tailscale/tailscale#9707