google / gvisor

Description

//test/benchmarks/tcp:tcp_benchmark can run with netstack as the iperf client, server, or neither (native). As the server (and with host GRO/GSO enabled) I see throughput similar to native. As the client, I regularly see the same pattern: a few seconds of throughput at parity with Linux followed by a complete cratering of throughput:

Looking at the logs, this looks to be triggered by netstack's inability to handle shrinking receive windows. In the pcap
shrinkingWindowMini.pcap.zip (shrunk down to only the relevant packets to keep the file size manageable), you can see two things. First, there are several instances of "normal" full receive buffers / zero windows. These are the reason for the graph's flat shape: transfer is limited by rwnd, not cwnd.

Second, at the end of the capture is the sequence of packets that corresponds to the massive throughput drop in the graph. There are two notable bits here:

There's an RTO-sized gap between the zero window ACK and the next packet (which is our zero window probe)
The receive window shrinks. This can't be seen in the sliced-up pcap (because it lacks the handshake with the window size), but with the full log:

Note that the [TCP Window Full] packet has sequence number 319710246 and length 1920, indicating that this fills the receive window. But the [TCP ZeroWindow] packet has the same sequence number, meaning that the 1920 sent bytes are out of window. Thus netstack considers this an RTO and drops the cwnd all the way to 1 segment, causing the slowdown.

But per RFC 9293 3.8.6, netstack shouldn't consider those bytes relevant to an RTO:

   A TCP receiver SHOULD NOT shrink the window, i.e., move the right
   window edge to the left (SHLD-14).  However, a sending TCP peer MUST
   be robust against window shrinking, which may cause the "usable
   window" (see Section 3.8.6.2.1) to become negative (MUST-34).

   If this happens, the sender SHOULD NOT send new data (SHLD-15), but
   SHOULD retransmit normally the old unacknowledged data between
   SND.UNA and SND.UNA+SND.WND (SHLD-16).  The sender MAY also
   retransmit old data beyond SND.UNA+SND.WND (MAY-7), but SHOULD NOT
   time out the connection if data beyond the right window edge is not
   acknowledged (SHLD-17).  If the window shrinks to zero, the TCP
   implementation MUST probe it in the standard way (described below)
   (MUST-35).

I.e. we should be treating this case like a regular zero-window. In terms of a fix, we can maybe have RTO handling only adjust cwnd when sent bytes are in-window.

Steps to reproduce

This has some excessive sudos leftover from when I was testing XDP:

$ bazel build //test/benchmarks/tcp:all && sudo cp bazel-bin/test/benchmarks/tcp/tcp_proxy_/tcp_proxy bazel-bin/test/benchmarks/tcp/tcp_proxy && sudo bazel-bin/test/benchmarks/tcp/tcp_benchmark --duration 20 --ideal --gso 65536 --no-user-ns --client

runsc version

N/A: This is netstack-specific

docker version (if using docker)

N/A: This is netstack-specific

repo state (if built from source)

release-20240415.0-18-g4810afc36

runsc debug logs (if available)

No response

So I think the issue arises from the fact that we piggy back on RTO timer.

As per RFC the behavior is correct to probe after 200ms

https://datatracker.ietf.org/doc/html/rfc9293#section-3.8.6.1

But because we piggy back on an RTO

gvisor/pkg/tcpip/transport/tcp/snd.go

Line 939 in de9adb5

s.resendTimer.enable(s.RTO)

we end up taking an RTO and reducing cwnd.

The right solution might be to split the timer and disable the RTO timer when probing for zero window. I would check the linux implementation.

tcp_benchmark --client throughput craters after a few seconds