google / gvisor

Application Kernel for Containers

Home Page:https://gvisor.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Strange behavior between PIDs limit and OOM reaper on KVM platform

jseba opened this issue · comments

commented

Description

I was doing some tests with the KVM platform and noticed that when I use the KVM platform, there's a strange interaction between the PIDs limit and the OOM reaper. I'm seeing the OOM reaper kill the sandbox for exceeding memory constraints instead of fork() failing due to PID cgroup constraints.

I'm was using bash fork bomb to test this at first, but I get the same behavior with a simple program that forks endlessly until it gets EAGAIN. I set the PID limit pretty low (1000) so I would expect to hit the PID limit pretty quickly, but instead it runs until the kernel OOM reaper kills it for using over 127TB(!) of virtual memory. If I set a memory limit for the container, then its the cgroup reaper that kills the sandbox; if I set the memory limit to unlimited, it's the global reaper that kills it.

Using the ptrace platform, I get the expected "cgroup: fork rejected by pids controller" message in the kernel logs.

It almost seems like the KVM platform doesn't honor the PID limit set in the container config? I'm not sure if I'm doing something wrong, so if I'm misconfiguring something, I would definitely appreciate knowing what I need to do to make this work as expected.

Steps to reproduce

Using an ubuntu image:

$ sudo docker run --rm --runtime=runsc-kvm --pids-limit 1000 --memory 0 ubuntu /bin/bash -c ":(){ :|:&}; :; sleep 30;"

Results in the following kernel message:
Oct 15 16:15:29 jseba-laptop kernel: Out of memory: Killed process 2358291 (exe) total-vm:137192174044kB, anon-rss:5916992kB, file-rss:1740kB, shmem-rss:9011620kB, UID:65534 pgtables:30680kB oom_score_adj:0

I have the runtime "runsc-kvm" configured in /etc/docker/daemon.json as:

{
    "runtimes": {
        "runsc": {
            "path": "/usr/local/bin/runsc"
        },
        "runsc-kvm": {
            "path": "/usr/local/bin/runsc",
	    "runtimeArgs": [
                "--platform=kvm"
	    ]
        }
}

runsc version

runsc version release-20210823.0

docker version (if using docker)

Client: Docker Engine - Community
 Version:           20.10.9
 API version:       1.41
 Go version:        go1.16.8
 Git commit:        c2ea9bc
 Built:             Mon Oct  4 16:08:55 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.9
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.8
  Git commit:       79ea9d3
  Built:            Mon Oct  4 16:07:01 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

uname

Linux jseba-laptop 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux

kubectl (if using Kubernetes)

No response

repo state (if built from source)

No response

runsc debug logs (if available)

No response

It almost seems like the KVM platform doesn't honor the PID limit set in the container config?

I would rephrase this: "gVisor doesn't honor the PID limit set in the container config". In case of the ptrace platform, we fork one system process per each guest processes and this is why you see the right behavior on it. The KVM platform uses the hardware virtualization to manage guest address spaces and so the number of guest processes doesn't affect a number of system threads.

@mrahatm is working on the support of cgroups in gVisor. Rahat, do we support the pids controller? Can we propagate limits from the container config?

We currently don't support PID controllers in gVisor's cgroupfs, no one's asked for it yet. I'll take a look to see how the container config maps to control files and see if we can implement a basic PID controller.

A friendly reminder that this issue had no activity for 120 days.