google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gonet hangs under load

ghjm opened this issue · comments

Description

I am trying to set up a pair of gVisor tcpip stacks joined by fdbased endpoints running over SOCK_SEQPACKET sockets produced by socketpair. This works for me under light load, but hangs under heavy load. I have a reproducer here: https://github.com/ghjm/gvisortest.

I have tried this with go 1.16, 1.17 and 1.18, and various older and newer gVisor releases. I have also tried many combinations of SOCK_SEQPACKET, SOCK_DGRAM, various socket options, etc., file-backed sockets not using socketpair, and various options to stack.New and fdbased.New, all of which either break it completely or make no difference. I have also observed that the magic number for nConns seems to be 20 - if I run it with 19 it works fine, but with 20 it hangs.

I am a novice with this codebase, and am perfectly prepared to believe the problem is in my code, but the reproducer is as pared down as I can get it.

Steps to reproduce

My results are:

Starting runNet 100
Finished runNet 100
Starting runGonet 10
Finished runGonet 10
Starting runGonet 100

No errors are returned at any point, it just hangs.

runsc version

n/a

docker version (if using docker)

n/a

uname

Linux graham-5520 5.16.18-200.fc35.x86_64 #1 SMP PREEMPT Mon Mar 28 14:10:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

kubectl (if using Kubernetes)

n/a

repo state (if built from source)

n/a

runsc debug logs (if available)

n/a

Thanks for the report. I will take a look at your repro and see if we can figure out what's going on.

I am investigating, I am able to reproduce the issue. I will post an update once I understand what's going on here.

Okay I think I understand the issue. The issue is the Listen Backlog which gonet defaults to 10 which is really small for a server that is getting bursts of 100 connections. A lot of them take forever to complete as they burst repeatedly w/ a deterministic backoff resulting in repeated drops of packets when the accept queue is full.

if err := ep.Listen(10); err != nil {

Bumping that number to say 1024 makes your test case pass. I will send a PR to bump this. Go net defaults to using a really high backlog value

https://cs.opensource.google/go/go/+/refs/tags/go1.18.1:src/net/sock_linux.go;drc=refs%2Ftags%2Fgo1.18.1;l=66

Which on my system is 4096,

Bumping up the backlog argument from 10 say 100 makes the 100 pass just fine.

This solved my problem. Thanks for your help!