tcp_benchmark --client throughput craters after a few seconds
kevinGC opened this issue · comments
Description
//test/benchmarks/tcp:tcp_benchmark
can run with netstack as the iperf client, server, or neither (native). As the server (and with host GRO/GSO enabled) I see throughput similar to native. As the client, I regularly see the same pattern: a few seconds of throughput at parity with Linux followed by a complete cratering of throughput:
Looking at the logs, this looks to be triggered by netstack's inability to handle shrinking receive windows. In the pcap
shrinkingWindowMini.pcap.zip (shrunk down to only the relevant packets to keep the file size manageable), you can see two things. First, there are several instances of "normal" full receive buffers / zero windows. These are the reason for the graph's flat shape: transfer is limited by rwnd, not cwnd.
Second, at the end of the capture is the sequence of packets that corresponds to the massive throughput drop in the graph. There are two notable bits here:
- There's an RTO-sized gap between the zero window ACK and the next packet (which is our zero window probe)
- The receive window shrinks. This can't be seen in the sliced-up pcap (because it lacks the handshake with the window size), but with the full log:
Note that the [TCP Window Full]
packet has sequence number 319710246 and length 1920, indicating that this fills the receive window. But the [TCP ZeroWindow]
packet has the same sequence number, meaning that the 1920 sent bytes are out of window. Thus netstack considers this an RTO and drops the cwnd all the way to 1 segment, causing the slowdown.
But per RFC 9293 3.8.6, netstack shouldn't consider those bytes relevant to an RTO:
A TCP receiver SHOULD NOT shrink the window, i.e., move the right
window edge to the left (SHLD-14). However, a sending TCP peer MUST
be robust against window shrinking, which may cause the "usable
window" (see Section 3.8.6.2.1) to become negative (MUST-34).
If this happens, the sender SHOULD NOT send new data (SHLD-15), but
SHOULD retransmit normally the old unacknowledged data between
SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also
retransmit old data beyond SND.UNA+SND.WND (MAY-7), but SHOULD NOT
time out the connection if data beyond the right window edge is not
acknowledged (SHLD-17). If the window shrinks to zero, the TCP
implementation MUST probe it in the standard way (described below)
(MUST-35).
I.e. we should be treating this case like a regular zero-window. In terms of a fix, we can maybe have RTO handling only adjust cwnd when sent bytes are in-window.
Steps to reproduce
This has some excessive sudo
s leftover from when I was testing XDP:
$ bazel build //test/benchmarks/tcp:all && sudo cp bazel-bin/test/benchmarks/tcp/tcp_proxy_/tcp_proxy bazel-bin/test/benchmarks/tcp/tcp_proxy && sudo bazel-bin/test/benchmarks/tcp/tcp_benchmark --duration 20 --ideal --gso 65536 --no-user-ns --client
runsc version
N/A: This is netstack-specific
docker version (if using docker)
N/A: This is netstack-specific
repo state (if built from source)
release-20240415.0-18-g4810afc36
runsc debug logs (if available)
No response
So I think the issue arises from the fact that we piggy back on RTO timer.
As per RFC the behavior is correct to probe after 200ms
https://datatracker.ietf.org/doc/html/rfc9293#section-3.8.6.1
But because we piggy back on an RTO
gvisor/pkg/tcpip/transport/tcp/snd.go
Line 939 in de9adb5
The right solution might be to split the timer and disable the RTO timer when probing for zero window. I would check the linux implementation.