google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

network perf regression with RACK when shaping traffic

rcj4747 opened this issue · comments

Description

We have a deployment of gvisor where traffic egress throughput is limited using iptables rules on the host that drop out-bound packets until the container has a budget for transmission. The overall throughput in our testing dropped significantly and we have bisected this to gvisor PR #6334 (Enable RACK by default in netstack) which changed gvisor's built-in TCP stack to always enable "Recent Acknowledgement" (RACK). This change first appeared in release-20210726.

It's not clear the root cause in gvisor's RACK implementation (or our iptables rules) to explain what is happening with our form of egress throughput control.

The implementation depends on the transport being enabled for Selective Acknowledgment; disabling tcp_sack (sysctl net.ipv4.tcp_sack=0) is an effective workaround but it is a blunt tool. Preferably we could get to root cause and address it; possibly with a config option to disable RACK in the interim so we don't loose the benefits of tcp_sack.

The associated iptables rules look like this:

# This limit applies per-pod to traffic egressing to the internet.
# Each pod starts with a 600Mbit burst (75MB). Once the burst is consumed traffic is
# limited to 200Mbit (190mbit/s or 23750kbyte/s base + 10mbit/s recharge of the
# burst). If no packets are seen for 60s, the burst buffer should be fully recharged
# and the entry is expired since this is equivalent to the uninitialized state.
iptables -A "${CHAIN_NAME}" -o eth+ \
  --match hashlimit \
  --hashlimit-mode srcip \
  --hashlimit-above 23750kb/s \
  --hashlimit-name public_egress_rate_limit \
  --hashlimit-burst 75m \
  --hashlimit-htable-expire 60000 \
  --jump DROP
# This limits (125mbyte/s = 1Gbit) applies per-pod to the sum of all traffic
# egressing to the ingress gateway, other services, and the public internet.
# We don't offer a burst because this already approaches the performance
# limits (2gbit/s egress) of the host.
iptables -A "${CHAIN_NAME}" \
  --match hashlimit \
  --hashlimit-mode srcip \
  --hashlimit-above 125mb/s \
  --hashlimit-name internal_egress_rate_limit \
  --jump DROP

The performance with release-20210720 is steady and averages 196Mb/s but with release-20210726 with RACK enabled and our iptables rules in place on the host the performance running iperf from a container to an external server is uneven and averages ~15-20Mb/s:

Connecting to host ..., port 5201
[  5] local ... port 30984 connected to ... port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  37.0 MBytes   311 Mbits/sec    0   0.00 Bytes       
[  5]   1.00-2.00   sec  16.8 MBytes   141 Mbits/sec    0   0.00 Bytes       
[  5]   2.00-3.00   sec   263 KBytes  2.15 Mbits/sec    0   0.00 Bytes       
[  5]   3.00-4.00   sec  15.5 KBytes   127 Kbits/sec    0   0.00 Bytes       
[  5]   4.00-5.00   sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]   5.00-6.00   sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]   6.00-7.00   sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]   7.00-8.00   sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]   8.00-9.00   sec  12.6 KBytes   104 Kbits/sec    0   0.00 Bytes       
[  5]   9.00-10.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  10.00-11.00  sec  13.9 KBytes   113 Kbits/sec    0   0.00 Bytes       
[  5]  11.00-12.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  12.00-13.00  sec  9.98 KBytes  81.8 Kbits/sec    0   0.00 Bytes       
[  5]  13.00-14.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  14.00-15.00  sec  13.9 KBytes   113 Kbits/sec    0   0.00 Bytes       
[  5]  15.00-16.00  sec  11.1 KBytes  91.0 Kbits/sec    0   0.00 Bytes       
[  5]  16.00-17.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  17.00-18.00  sec  12.6 KBytes   103 Kbits/sec    0   0.00 Bytes       
[  5]  18.00-19.00  sec  11.1 KBytes  90.9 Kbits/sec    0   0.00 Bytes       
[  5]  19.00-20.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  20.00-21.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  21.00-22.00  sec  32.8 MBytes   275 Mbits/sec    0   0.00 Bytes       
[  5]  22.00-23.00  sec  4.16 KBytes  34.1 Kbits/sec    0   0.00 Bytes       
[  5]  23.00-24.00  sec  12.5 KBytes   102 Kbits/sec    0   0.00 Bytes       
[  5]  24.00-25.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  25.00-26.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  26.00-27.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  27.00-28.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  28.00-29.00  sec  14.3 KBytes   117 Kbits/sec    0   0.00 Bytes       
[  5]  29.00-30.00  sec  32.3 MBytes   271 Mbits/sec    0   0.00 Bytes       
[  5]  30.00-31.00  sec  5.55 KBytes  45.4 Kbits/sec    0   0.00 Bytes       
[  5]  31.00-32.00  sec  16.9 KBytes   138 Kbits/sec    0   0.00 Bytes       
[  5]  32.00-33.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  33.00-34.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  34.00-35.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  35.00-36.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  36.00-37.00  sec  18.0 KBytes   148 Kbits/sec    0   0.00 Bytes       
[  5]  37.00-38.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  38.00-39.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  39.00-40.00  sec   103 KBytes   844 Kbits/sec    0   0.00 Bytes       
[  5]  40.00-41.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  41.00-42.00  sec  11.1 KBytes  90.8 Kbits/sec    0   0.00 Bytes       
[  5]  42.00-43.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  43.00-44.00  sec  19.8 KBytes   162 Kbits/sec    0   0.00 Bytes       
[  5]  44.00-45.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  45.00-46.00  sec  11.1 KBytes  90.9 Kbits/sec    0   0.00 Bytes       
[  5]  46.00-47.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  47.00-48.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  48.00-49.00  sec  12.5 KBytes   102 Kbits/sec    0   0.00 Bytes       
[  5]  49.00-50.00  sec  12.5 KBytes   102 Kbits/sec    0   0.00 Bytes       
[  5]  50.00-51.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  51.00-52.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  52.00-53.00  sec  33.7 MBytes   283 Mbits/sec    0   0.00 Bytes       
[  5]  53.00-54.00  sec  93.3 KBytes   765 Kbits/sec    0   0.00 Bytes       
[  5]  54.00-55.00  sec  35.6 MBytes   299 Mbits/sec    0   0.00 Bytes       
[  5]  55.00-56.00  sec   762 KBytes  6.24 Mbits/sec    0   0.00 Bytes       
[  5]  56.00-57.00  sec   296 KBytes  2.43 Mbits/sec    0   0.00 Bytes       
[  5]  57.00-58.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  58.00-59.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  59.00-60.00  sec  12.9 KBytes   106 Kbits/sec    0   0.00 Bytes       
[  5]  60.00-61.00  sec  31.8 MBytes   267 Mbits/sec    0   0.00 Bytes       
[  5]  61.00-62.00  sec   825 KBytes  6.76 Mbits/sec    0   0.00 Bytes       
[  5]  62.00-63.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  63.00-64.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  64.00-65.00  sec  11.5 KBytes  94.3 Kbits/sec    0   0.00 Bytes       
[  5]  65.00-66.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  66.00-67.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  67.00-68.00  sec  31.4 MBytes   264 Mbits/sec    0   0.00 Bytes       
[  5]  68.00-69.00  sec   592 KBytes  4.85 Mbits/sec    0   0.00 Bytes       
[  5]  69.00-70.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  70.00-71.00  sec  22.6 KBytes   185 Kbits/sec    0   0.00 Bytes       
[  5]  71.00-72.00  sec  18.0 KBytes   148 Kbits/sec    0   0.00 Bytes       
[  5]  72.00-73.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  73.00-74.00  sec  11.1 KBytes  90.9 Kbits/sec    0   0.00 Bytes       
[  5]  74.00-75.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  75.00-76.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  76.00-77.00  sec  11.1 KBytes  90.9 Kbits/sec    0   0.00 Bytes       
[  5]  77.00-78.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  78.00-79.00  sec  12.5 KBytes   102 Kbits/sec    0   0.00 Bytes       
[  5]  79.00-80.00  sec  12.5 KBytes   102 Kbits/sec    0   0.00 Bytes       
[  5]  80.00-81.00  sec  12.9 KBytes   106 Kbits/sec    0   0.00 Bytes       
[  5]  81.00-82.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  82.00-83.00  sec  11.1 KBytes  91.0 Kbits/sec    0   0.00 Bytes       
[  5]  83.00-84.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  84.00-85.00  sec  17.6 MBytes   148 Mbits/sec    0   0.00 Bytes       
[  5]  85.00-86.00  sec  15.6 MBytes   131 Mbits/sec    0   0.00 Bytes       
[  5]  86.00-87.00  sec   295 KBytes  2.42 Mbits/sec    0   0.00 Bytes       
[  5]  87.00-88.00  sec  13.9 KBytes   114 Kbits/sec    0   0.00 Bytes       
[  5]  88.00-89.00  sec  24.7 MBytes   207 Mbits/sec    0   0.00 Bytes       
[  5]  89.00-90.00  sec  5.38 MBytes  45.1 Mbits/sec    0   0.00 Bytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-90.00  sec   319 MBytes  29.7 Mbits/sec    0             sender
[  5]   0.00-90.00  sec   316 MBytes  29.4 Mbits/sec                  receiver

Steps to reproduce

Add ipchains rules based on those in the description to the host where the iperf test is run today. Observe the throughput with RACK enabled.

runsc version

release-20210726 was the first version impacted

docker version (if using docker)

No response

uname

4.19.0-17 kernel from Debian 10 Buster

kubectl (if using Kubernetes)

1.21.11

repo state (if built from source)

Not built from source

runsc debug logs (if available)

None available

Here's a visualization of our iperf data before and after the change to enable RACK. (20210510 in red, 20211005 in blue)
image
At the time I created this chart we hadn't bisected to the RACK commit in 20210726, but the results between the two closest releases to the change, 20210702 & 20210726, would appear the same as these two.

@rcj4747 thanks for the detailed report. I will take a look most likely we may have gotten something wrong with our RACK implementation. Would it be possible to attach a pcap of the traffic regression. It would help understand whats going on a bit better.

@rcj4747 thanks for the detailed report. I will take a look most likely we may have gotten something wrong with our RACK implementation. Would it be possible to attach a pcap of the traffic regression. It would help understand whats going on a bit better.

I'll try to get a pcap shortly.. Thanks for the quick reply.

Here is a pcap (rack.pcap.gz) for a 90 second iperf run and it's output (rack.iperf.txt)

Thanks I will take a look and see if I can figure out what's going on.

Okay I looked at the ptrace and I think there is something clearly wrong with our RACK implementation. eg. After the peer advertised a SACK block we endup retransmitting but with a long delay (mostly likely the default probe duration) in between.
eg. there are more instances later in the pcap but here's a small one (some comments inlined). But basically RACK recovery is not working as expected and we are taking RTO's in between during the recovery episode causing the transfer to basically stall altogether. This seems like RACK is pretty much broken. I am surprised we didn't see this before.

No.	Time	Source	Destination	Protocol	Length	Info
1705	0.160355	104.131.44.237	10.244.0.105	TCP	66	5201 → 65345 [ACK] Seq=1 Ack=26822758 Win=3143296 Len=0 TSval=262033161 TSecr=3065010041
1706	0.160355	10.244.0.105	104.131.44.237	TCP	10006	65345 → 5201 [ACK] Seq=26929690 Ack=1 Win=524288 Len=9940 TSval=3065010041 TSecr=262033161
1707	0.160418	10.244.0.105	104.131.44.237	TCP	10006	65345 → 5201 [ACK] Seq=26939630 Ack=1 Win=524288 Len=9940 TSval=3065010041 TSecr=262033161
1708	0.163067	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26939630 Ack=1 Win=524288 Len=1420 TSval=3065010044 TSecr=262033161 
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is probably the RACK TLP timer firing as the last packet wasn't ACKed for 30ms.

1709	0.163287	104.131.44.237	10.244.0.105	TCP	78	[TCP Dup ACK 1705#1] 5201 → 65345 [ACK] Seq=1 Ack=26822758 Win=3143296 Len=0 TSval=262033164 TSecr=3065010041 SLE=26939630 SRE=26941050

^^^ We see the peer sending a SACK block acking the receipt of the retransmitted segment

1710	0.363789	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26822758 Ack=1 Win=524288 Len=1420 TSval=3065010244 TSecr=262033164

But it takes us almost 20ms before we retransmit data w/ sequence number acked in the Dup ACK above.

1711	0.364456	104.131.44.237	10.244.0.105	TCP	78	5201 → 65345 [ACK] Seq=1 Ack=26824178 Win=3141888 Len=0 TSval=262033365 TSecr=3065010244 SLE=26939630 SRE=26941050
1712	0.364623	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26824178 Ack=1 Win=524288 Len=1420 TSval=3065010245 TSecr=262033365
1713	0.364836	104.131.44.237	10.244.0.105	TCP	78	5201 → 65345 [ACK] Seq=1 Ack=26825598 Win=3140864 Len=0 TSval=262033366 TSecr=3065010245 SLE=26939630 SRE=26941050

^^^^ Here we see an ACK from the peer but then we don't transmit anything for 200ms  which to me looks like an RTO due to outstanding data. But ideally we should have retransmitted the next packet at this point rather than wait for an RTO.

1714	0.565358	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26825598 Ack=1 Win=524288 Len=1420 TSval=3065010446 TSecr=262033366
1715	0.566006	104.131.44.237	10.244.0.105	TCP	78	5201 → 65345 [ACK] Seq=1 Ack=26827018 Win=3140864 Len=0 TSval=262033567 TSecr=3065010446 SLE=26939630 SRE=26941050
1716	0.566161	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26827018 Ack=1 Win=524288 Len=1420 TSval=3065010447 TSecr=262033567
1717	0.566355	104.131.44.237	10.244.0.105	TCP	78	5201 → 65345 [ACK] Seq=1 Ack=26828438 Win=3140864 Len=0 TSval=262033567 TSecr=3065010447 SLE=26939630 SRE=26941050

^^^ We see the same pattern repeat where we are now again waiting for an RTO

1718	0.766919	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26828438 Ack=1 Win=524288 Len=1420 TSval=3065010648 TSecr=262033567
1719	0.767589	104.131.44.237	10.244.0.105	TCP	78	5201 → 65345 [ACK] Seq=1 Ack=26829858 Win=3140864 Len=0 TSval=262033768 TSecr=3065010648 SLE=26939630 SRE=26941050
1720	0.767786	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26829858 Ack=1 Win=524288 Len=1420 TSval=3065010649 TSecr=262033768
1721	0.768025	104.131.44.237	10.244.0.105	TCP	78	5201 → 65345 [ACK] Seq=1 Ack=26831278 Win=3140864 Len=0 TSval=262033769 TSecr=3065010649 SLE=26939630 SRE=26941050

^^^ Ditto here.

1722	0.968751	10.244.0.105	104.131.44.237	TCP	1486	[TCP Retransmission] 65345 → 5201 [PSH, ACK] Seq=26831278 Ack=1 Win=524288 Len=1420 TSval=3065010849 TSecr=262033769
1723	0.969404	104.131.44.237	10.244.0.105	TCP	78	5201 → 65345 [ACK] Seq=1 Ack=26832698 Win=3140864 Len=0 TSval=262033970 TSecr=3065010849 SLE=26939630 SRE=26941050

Would it make sense to add a network iperf test that has iptables rules similar to those in the issue description to ensure you can recreate the issue, verify changes, and catch regressions?

Actually let me ask for one more thing. Its possible that Netstack is behaving mostly as expected. Since if the ipfilter rules were being applied after the pcap then some of those inbound packets might have been dropped before they hit netstack. Could you use tcpdump w/ libpcap1.9 or below inside gvisor to get a pcap so that we can get a view of what exact packets gVisor is seeing.

As for adding a regression test etc sure once we confirm and understand the issue. Seeing a pcap from only one side can be misleading sometimes .

NOTE: libpcap1.10+ removed support for recvmsg w/ AF_PACKET and exclusively uses AF_PACKET_RING which is not yet implemented in gVisor.

I'll need to figure out permissions. I have libpcap-1.9.1-r2 and tcpdump-4.9.3-r2 installed but I can't run tcpdump as root on any interfaces:

# tcpdump -i eth0 -s 90 -w norack.pcap port 5201
tcpdump: eth0: You don't have permission to capture on that device
(socket: Operation not permitted)

I suspect we're just missing the --net-raw flag

You also need to pass -p to disable promiscuous mode as we don't support that in gvisor.

rack-port-5201.pcap.gz was created from inside the container as requested, it has the same characteristics as the prior runs.

Thanks for the pcap. I wonder if you see the same behaviour when you use runc to run the container. The throughput graph seems to match your limit before it craters to zero for a few seconds before spiking again. It probably works better without SACK because NewReno is more aggressive and does faster recovery. Could you run your workload with the same hashlimits but with runc instead of runsc and see what you get?
image

Running with runc does not recreate the condition we see here. With runc the network throughput is a steady ~195Mb/s (the target of our iptables rules). Testing runc is something I did before filing this issue but I failed to give a full rundown of my efforts. Here are some of my other notes:

  • Minimized other factors: Altered the test environment to ensure dedicated CPUs were used for the iperf server and the iperf client, though no performance changes were observed as the iptables rules keep us well below the interface capabilities. All tests were performed in a dedicated cluster to minimize issues with noisy neighbors. Versions for all other components of the environment we kept constant throughout testing.
  • Verified we weren't CPU constrained via cpu utilization and cpu throttling (cadvisor's container_cpu_cfs_throttled_*) stats. CPU usage and throttling was much lower when we had performance issues, so that does not appear to be a factor.
  • Ensured we could saturate the interface by removing the iptables rules and testing with gVisor versions before and after RACK was enabled by default (20210720 & 20210726). Both versions showed sustained throughput reaching the peak for the interface.
  • Testing was performed with SACK enabled until I discovered that RACK might be an issue. I turned off SACK as a means of disabling RACK for gVisor versions after 20210726 and as a workaround for the issue. But with SACK enabled we have nominal performance with runc and gvisor release-20210720 and prior. We also have nominal performance with the iptable rules removed and SACK enabled with gvisor release-20210726 and later.

Thanks for confirming. I will work with Nayana to see what is going haywire with our RACK implementation. I will keep you posted once we find something.

@rcj4747 Nayana and me spent sometime trying to repro this but I have been unable to repro the drop in throughput. I tried a few variations including an internal benchmarking tool we have but I am not seeing the same behaviour as seen by you.

Is it possible to setup a test environment to which you can provide access to me/nybidari@ where we can debug and figure out what's going on?

I tried with iptables and even tc to rate limit the streams but I see a steady 200mbits/s. I even tried w/ varying latencies and it does stay steady.

@hbhasker Thank you for trying to reproduce this. I will see if I can recreate in an accessible environment and provide access.