network perf regression with RACK when shaping traffic
rcj4747 opened this issue · comments
Description
We have a deployment of gvisor where traffic egress throughput is limited using iptables rules on the host that drop out-bound packets until the container has a budget for transmission. The overall throughput in our testing dropped significantly and we have bisected this to gvisor PR #6334 (Enable RACK by default in netstack) which changed gvisor's built-in TCP stack to always enable "Recent Acknowledgement" (RACK). This change first appeared in release-20210726.
It's not clear the root cause in gvisor's RACK implementation (or our iptables rules) to explain what is happening with our form of egress throughput control.
The implementation depends on the transport being enabled for Selective Acknowledgment; disabling tcp_sack (sysctl net.ipv4.tcp_sack=0
) is an effective workaround but it is a blunt tool. Preferably we could get to root cause and address it; possibly with a config option to disable RACK in the interim so we don't loose the benefits of tcp_sack.
The associated iptables rules look like this:
# This limit applies per-pod to traffic egressing to the internet.
# Each pod starts with a 600Mbit burst (75MB). Once the burst is consumed traffic is
# limited to 200Mbit (190mbit/s or 23750kbyte/s base + 10mbit/s recharge of the
# burst). If no packets are seen for 60s, the burst buffer should be fully recharged
# and the entry is expired since this is equivalent to the uninitialized state.
iptables -A "${CHAIN_NAME}" -o eth+ \
--match hashlimit \
--hashlimit-mode srcip \
--hashlimit-above 23750kb/s \
--hashlimit-name public_egress_rate_limit \
--hashlimit-burst 75m \
--hashlimit-htable-expire 60000 \
--jump DROP
# This limits (125mbyte/s = 1Gbit) applies per-pod to the sum of all traffic
# egressing to the ingress gateway, other services, and the public internet.
# We don't offer a burst because this already approaches the performance
# limits (2gbit/s egress) of the host.
iptables -A "${CHAIN_NAME}" \
--match hashlimit \
--hashlimit-mode srcip \
--hashlimit-above 125mb/s \
--hashlimit-name internal_egress_rate_limit \
--jump DROP
The performance with release-20210720 is steady and averages 196Mb/s but with release-20210726 with RACK enabled and our iptables rules in place on the host the performance running iperf from a container to an external server is uneven and averages ~15-20Mb/s:
Connecting to host ..., port 5201
[ 5] local ... port 30984 connected to ... port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 37.0 MBytes 311 Mbits/sec 0 0.00 Bytes
[ 5] 1.00-2.00 sec 16.8 MBytes 141 Mbits/sec 0 0.00 Bytes
[ 5] 2.00-3.00 sec 263 KBytes 2.15 Mbits/sec 0 0.00 Bytes
[ 5] 3.00-4.00 sec 15.5 KBytes 127 Kbits/sec 0 0.00 Bytes
[ 5] 4.00-5.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 5.00-6.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 6.00-7.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 7.00-8.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 8.00-9.00 sec 12.6 KBytes 104 Kbits/sec 0 0.00 Bytes
[ 5] 9.00-10.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 10.00-11.00 sec 13.9 KBytes 113 Kbits/sec 0 0.00 Bytes
[ 5] 11.00-12.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 12.00-13.00 sec 9.98 KBytes 81.8 Kbits/sec 0 0.00 Bytes
[ 5] 13.00-14.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 14.00-15.00 sec 13.9 KBytes 113 Kbits/sec 0 0.00 Bytes
[ 5] 15.00-16.00 sec 11.1 KBytes 91.0 Kbits/sec 0 0.00 Bytes
[ 5] 16.00-17.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 17.00-18.00 sec 12.6 KBytes 103 Kbits/sec 0 0.00 Bytes
[ 5] 18.00-19.00 sec 11.1 KBytes 90.9 Kbits/sec 0 0.00 Bytes
[ 5] 19.00-20.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 20.00-21.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 21.00-22.00 sec 32.8 MBytes 275 Mbits/sec 0 0.00 Bytes
[ 5] 22.00-23.00 sec 4.16 KBytes 34.1 Kbits/sec 0 0.00 Bytes
[ 5] 23.00-24.00 sec 12.5 KBytes 102 Kbits/sec 0 0.00 Bytes
[ 5] 24.00-25.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 25.00-26.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 26.00-27.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 27.00-28.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 28.00-29.00 sec 14.3 KBytes 117 Kbits/sec 0 0.00 Bytes
[ 5] 29.00-30.00 sec 32.3 MBytes 271 Mbits/sec 0 0.00 Bytes
[ 5] 30.00-31.00 sec 5.55 KBytes 45.4 Kbits/sec 0 0.00 Bytes
[ 5] 31.00-32.00 sec 16.9 KBytes 138 Kbits/sec 0 0.00 Bytes
[ 5] 32.00-33.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 33.00-34.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 34.00-35.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 35.00-36.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 36.00-37.00 sec 18.0 KBytes 148 Kbits/sec 0 0.00 Bytes
[ 5] 37.00-38.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 38.00-39.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 39.00-40.00 sec 103 KBytes 844 Kbits/sec 0 0.00 Bytes
[ 5] 40.00-41.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 41.00-42.00 sec 11.1 KBytes 90.8 Kbits/sec 0 0.00 Bytes
[ 5] 42.00-43.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 43.00-44.00 sec 19.8 KBytes 162 Kbits/sec 0 0.00 Bytes
[ 5] 44.00-45.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 45.00-46.00 sec 11.1 KBytes 90.9 Kbits/sec 0 0.00 Bytes
[ 5] 46.00-47.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 47.00-48.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 48.00-49.00 sec 12.5 KBytes 102 Kbits/sec 0 0.00 Bytes
[ 5] 49.00-50.00 sec 12.5 KBytes 102 Kbits/sec 0 0.00 Bytes
[ 5] 50.00-51.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 51.00-52.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 52.00-53.00 sec 33.7 MBytes 283 Mbits/sec 0 0.00 Bytes
[ 5] 53.00-54.00 sec 93.3 KBytes 765 Kbits/sec 0 0.00 Bytes
[ 5] 54.00-55.00 sec 35.6 MBytes 299 Mbits/sec 0 0.00 Bytes
[ 5] 55.00-56.00 sec 762 KBytes 6.24 Mbits/sec 0 0.00 Bytes
[ 5] 56.00-57.00 sec 296 KBytes 2.43 Mbits/sec 0 0.00 Bytes
[ 5] 57.00-58.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 58.00-59.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 59.00-60.00 sec 12.9 KBytes 106 Kbits/sec 0 0.00 Bytes
[ 5] 60.00-61.00 sec 31.8 MBytes 267 Mbits/sec 0 0.00 Bytes
[ 5] 61.00-62.00 sec 825 KBytes 6.76 Mbits/sec 0 0.00 Bytes
[ 5] 62.00-63.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 63.00-64.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 64.00-65.00 sec 11.5 KBytes 94.3 Kbits/sec 0 0.00 Bytes
[ 5] 65.00-66.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 66.00-67.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 67.00-68.00 sec 31.4 MBytes 264 Mbits/sec 0 0.00 Bytes
[ 5] 68.00-69.00 sec 592 KBytes 4.85 Mbits/sec 0 0.00 Bytes
[ 5] 69.00-70.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 70.00-71.00 sec 22.6 KBytes 185 Kbits/sec 0 0.00 Bytes
[ 5] 71.00-72.00 sec 18.0 KBytes 148 Kbits/sec 0 0.00 Bytes
[ 5] 72.00-73.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 73.00-74.00 sec 11.1 KBytes 90.9 Kbits/sec 0 0.00 Bytes
[ 5] 74.00-75.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 75.00-76.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 76.00-77.00 sec 11.1 KBytes 90.9 Kbits/sec 0 0.00 Bytes
[ 5] 77.00-78.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 78.00-79.00 sec 12.5 KBytes 102 Kbits/sec 0 0.00 Bytes
[ 5] 79.00-80.00 sec 12.5 KBytes 102 Kbits/sec 0 0.00 Bytes
[ 5] 80.00-81.00 sec 12.9 KBytes 106 Kbits/sec 0 0.00 Bytes
[ 5] 81.00-82.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 82.00-83.00 sec 11.1 KBytes 91.0 Kbits/sec 0 0.00 Bytes
[ 5] 83.00-84.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 84.00-85.00 sec 17.6 MBytes 148 Mbits/sec 0 0.00 Bytes
[ 5] 85.00-86.00 sec 15.6 MBytes 131 Mbits/sec 0 0.00 Bytes
[ 5] 86.00-87.00 sec 295 KBytes 2.42 Mbits/sec 0 0.00 Bytes
[ 5] 87.00-88.00 sec 13.9 KBytes 114 Kbits/sec 0 0.00 Bytes
[ 5] 88.00-89.00 sec 24.7 MBytes 207 Mbits/sec 0 0.00 Bytes
[ 5] 89.00-90.00 sec 5.38 MBytes 45.1 Mbits/sec 0 0.00 Bytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-90.00 sec 319 MBytes 29.7 Mbits/sec 0 sender
[ 5] 0.00-90.00 sec 316 MBytes 29.4 Mbits/sec receiver
Steps to reproduce
Add ipchains rules based on those in the description to the host where the iperf test is run today. Observe the throughput with RACK enabled.
runsc version
release-20210726 was the first version impacted
docker version (if using docker)
No response
uname
4.19.0-17 kernel from Debian 10 Buster
kubectl (if using Kubernetes)
1.21.11
repo state (if built from source)
Not built from source
runsc debug logs (if available)
None available
Here's a visualization of our iperf data before and after the change to enable RACK. (20210510 in red, 20211005 in blue)
At the time I created this chart we hadn't bisected to the RACK commit in 20210726, but the results between the two closest releases to the change, 20210702 & 20210726, would appear the same as these two.
@rcj4747 thanks for the detailed report. I will take a look most likely we may have gotten something wrong with our RACK implementation. Would it be possible to attach a pcap of the traffic regression. It would help understand whats going on a bit better.
@rcj4747 thanks for the detailed report. I will take a look most likely we may have gotten something wrong with our RACK implementation. Would it be possible to attach a pcap of the traffic regression. It would help understand whats going on a bit better.
I'll try to get a pcap shortly.. Thanks for the quick reply.
Here is a pcap (rack.pcap.gz) for a 90 second iperf run and it's output (rack.iperf.txt)
Thanks I will take a look and see if I can figure out what's going on.
Okay I looked at the ptrace and I think there is something clearly wrong with our RACK implementation. eg. After the peer advertised a SACK block we endup retransmitting but with a long delay (mostly likely the default probe duration) in between.
eg. there are more instances later in the pcap but here's a small one (some comments inlined). But basically RACK recovery is not working as expected and we are taking RTO's in between during the recovery episode causing the transfer to basically stall altogether. This seems like RACK is pretty much broken. I am surprised we didn't see this before.
No. Time Source Destination Protocol Length Info
1705 0.160355 104.131.44.237 10.244.0.105 TCP 66 5201 → 65345 [ACK] Seq=1 Ack=26822758 Win=3143296 Len=0 TSval=262033161 TSecr=3065010041
1706 0.160355 10.244.0.105 104.131.44.237 TCP 10006 65345 → 5201 [ACK] Seq=26929690 Ack=1 Win=524288 Len=9940 TSval=3065010041 TSecr=262033161
1707 0.160418 10.244.0.105 104.131.44.237 TCP 10006 65345 → 5201 [ACK] Seq=26939630 Ack=1 Win=524288 Len=9940 TSval=3065010041 TSecr=262033161
1708 0.163067 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26939630 Ack=1 Win=524288 Len=1420 TSval=3065010044 TSecr=262033161
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is probably the RACK TLP timer firing as the last packet wasn't ACKed for 30ms.
1709 0.163287 104.131.44.237 10.244.0.105 TCP 78 [TCP Dup ACK 1705#1] 5201 → 65345 [ACK] Seq=1 Ack=26822758 Win=3143296 Len=0 TSval=262033164 TSecr=3065010041 SLE=26939630 SRE=26941050
^^^ We see the peer sending a SACK block acking the receipt of the retransmitted segment
1710 0.363789 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26822758 Ack=1 Win=524288 Len=1420 TSval=3065010244 TSecr=262033164
But it takes us almost 20ms before we retransmit data w/ sequence number acked in the Dup ACK above.
1711 0.364456 104.131.44.237 10.244.0.105 TCP 78 5201 → 65345 [ACK] Seq=1 Ack=26824178 Win=3141888 Len=0 TSval=262033365 TSecr=3065010244 SLE=26939630 SRE=26941050
1712 0.364623 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26824178 Ack=1 Win=524288 Len=1420 TSval=3065010245 TSecr=262033365
1713 0.364836 104.131.44.237 10.244.0.105 TCP 78 5201 → 65345 [ACK] Seq=1 Ack=26825598 Win=3140864 Len=0 TSval=262033366 TSecr=3065010245 SLE=26939630 SRE=26941050
^^^^ Here we see an ACK from the peer but then we don't transmit anything for 200ms which to me looks like an RTO due to outstanding data. But ideally we should have retransmitted the next packet at this point rather than wait for an RTO.
1714 0.565358 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26825598 Ack=1 Win=524288 Len=1420 TSval=3065010446 TSecr=262033366
1715 0.566006 104.131.44.237 10.244.0.105 TCP 78 5201 → 65345 [ACK] Seq=1 Ack=26827018 Win=3140864 Len=0 TSval=262033567 TSecr=3065010446 SLE=26939630 SRE=26941050
1716 0.566161 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26827018 Ack=1 Win=524288 Len=1420 TSval=3065010447 TSecr=262033567
1717 0.566355 104.131.44.237 10.244.0.105 TCP 78 5201 → 65345 [ACK] Seq=1 Ack=26828438 Win=3140864 Len=0 TSval=262033567 TSecr=3065010447 SLE=26939630 SRE=26941050
^^^ We see the same pattern repeat where we are now again waiting for an RTO
1718 0.766919 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26828438 Ack=1 Win=524288 Len=1420 TSval=3065010648 TSecr=262033567
1719 0.767589 104.131.44.237 10.244.0.105 TCP 78 5201 → 65345 [ACK] Seq=1 Ack=26829858 Win=3140864 Len=0 TSval=262033768 TSecr=3065010648 SLE=26939630 SRE=26941050
1720 0.767786 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26829858 Ack=1 Win=524288 Len=1420 TSval=3065010649 TSecr=262033768
1721 0.768025 104.131.44.237 10.244.0.105 TCP 78 5201 → 65345 [ACK] Seq=1 Ack=26831278 Win=3140864 Len=0 TSval=262033769 TSecr=3065010649 SLE=26939630 SRE=26941050
^^^ Ditto here.
1722 0.968751 10.244.0.105 104.131.44.237 TCP 1486 [TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26831278 Ack=1 Win=524288 Len=1420 TSval=3065010849 TSecr=262033769
1723 0.969404 104.131.44.237 10.244.0.105 TCP 78 5201 → 65345 [ACK] Seq=1 Ack=26832698 Win=3140864 Len=0 TSval=262033970 TSecr=3065010849 SLE=26939630 SRE=26941050
Would it make sense to add a network iperf test that has iptables rules similar to those in the issue description to ensure you can recreate the issue, verify changes, and catch regressions?
Actually let me ask for one more thing. Its possible that Netstack is behaving mostly as expected. Since if the ipfilter rules were being applied after the pcap then some of those inbound packets might have been dropped before they hit netstack. Could you use tcpdump w/ libpcap1.9 or below inside gvisor to get a pcap so that we can get a view of what exact packets gVisor is seeing.
As for adding a regression test etc sure once we confirm and understand the issue. Seeing a pcap from only one side can be misleading sometimes .
NOTE: libpcap1.10+ removed support for recvmsg w/ AF_PACKET and exclusively uses AF_PACKET_RING which is not yet implemented in gVisor.
I'll need to figure out permissions. I have libpcap-1.9.1-r2
and tcpdump-4.9.3-r2
installed but I can't run tcpdump as root on any interfaces:
# tcpdump -i eth0 -s 90 -w norack.pcap port 5201
tcpdump: eth0: You don't have permission to capture on that device
(socket: Operation not permitted)
I suspect we're just missing the --net-raw
flag
You also need to pass -p to disable promiscuous mode as we don't support that in gvisor.
rack-port-5201.pcap.gz was created from inside the container as requested, it has the same characteristics as the prior runs.
Thanks for the pcap. I wonder if you see the same behaviour when you use runc to run the container. The throughput graph seems to match your limit before it craters to zero for a few seconds before spiking again. It probably works better without SACK because NewReno is more aggressive and does faster recovery. Could you run your workload with the same hashlimits but with runc instead of runsc and see what you get?
Running with runc does not recreate the condition we see here. With runc the network throughput is a steady ~195Mb/s (the target of our iptables rules). Testing runc is something I did before filing this issue but I failed to give a full rundown of my efforts. Here are some of my other notes:
- Minimized other factors: Altered the test environment to ensure dedicated CPUs were used for the iperf server and the iperf client, though no performance changes were observed as the iptables rules keep us well below the interface capabilities. All tests were performed in a dedicated cluster to minimize issues with noisy neighbors. Versions for all other components of the environment we kept constant throughout testing.
- Verified we weren't CPU constrained via cpu utilization and cpu throttling (cadvisor's
container_cpu_cfs_throttled_*
) stats. CPU usage and throttling was much lower when we had performance issues, so that does not appear to be a factor. - Ensured we could saturate the interface by removing the iptables rules and testing with gVisor versions before and after RACK was enabled by default (20210720 & 20210726). Both versions showed sustained throughput reaching the peak for the interface.
- Testing was performed with SACK enabled until I discovered that RACK might be an issue. I turned off SACK as a means of disabling RACK for gVisor versions after 20210726 and as a workaround for the issue. But with SACK enabled we have nominal performance with runc and gvisor release-20210720 and prior. We also have nominal performance with the iptable rules removed and SACK enabled with gvisor release-20210726 and later.
Thanks for confirming. I will work with Nayana to see what is going haywire with our RACK implementation. I will keep you posted once we find something.
@rcj4747 Nayana and me spent sometime trying to repro this but I have been unable to repro the drop in throughput. I tried a few variations including an internal benchmarking tool we have but I am not seeing the same behaviour as seen by you.
Is it possible to setup a test environment to which you can provide access to me/nybidari@ where we can debug and figure out what's going on?
I tried with iptables and even tc to rate limit the streams but I see a steady 200mbits/s. I even tried w/ varying latencies and it does stay steady.
@hbhasker Thank you for trying to reproduce this. I will see if I can recreate in an accessible environment and provide access.