Gonet hangs under load
ghjm opened this issue · comments
Description
I am trying to set up a pair of gVisor tcpip stacks joined by fdbased endpoints running over SOCK_SEQPACKET sockets produced by socketpair. This works for me under light load, but hangs under heavy load. I have a reproducer here: https://github.com/ghjm/gvisortest.
I have tried this with go 1.16, 1.17 and 1.18, and various older and newer gVisor releases. I have also tried many combinations of SOCK_SEQPACKET, SOCK_DGRAM, various socket options, etc., file-backed sockets not using socketpair, and various options to stack.New and fdbased.New, all of which either break it completely or make no difference. I have also observed that the magic number for nConns seems to be 20 - if I run it with 19 it works fine, but with 20 it hangs.
I am a novice with this codebase, and am perfectly prepared to believe the problem is in my code, but the reproducer is as pared down as I can get it.
Steps to reproduce
- git clone https://github.com/ghjm/gvisortest
- go build gvisortest.go
- ./gvisortest
My results are:
Starting runNet 100
Finished runNet 100
Starting runGonet 10
Finished runGonet 10
Starting runGonet 100
No errors are returned at any point, it just hangs.
runsc version
n/a
docker version (if using docker)
n/a
uname
Linux graham-5520 5.16.18-200.fc35.x86_64 #1 SMP PREEMPT Mon Mar 28 14:10:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
kubectl (if using Kubernetes)
n/a
repo state (if built from source)
n/a
runsc debug logs (if available)
n/a
Thanks for the report. I will take a look at your repro and see if we can figure out what's going on.
I am investigating, I am able to reproduce the issue. I will post an update once I understand what's going on here.
Okay I think I understand the issue. The issue is the Listen Backlog which gonet defaults to 10 which is really small for a server that is getting bursts of 100 connections. A lot of them take forever to complete as they burst repeatedly w/ a deterministic backoff resulting in repeated drops of packets when the accept queue is full.
gvisor/pkg/tcpip/adapters/gonet/gonet.go
Line 85 in 1084106
Bumping that number to say 1024 makes your test case pass. I will send a PR to bump this. Go net defaults to using a really high backlog value
Which on my system is 4096,
Bumping up the backlog argument from 10 say 100 makes the 100 pass just fine.
This solved my problem. Thanks for your help!