OOM OCI Events Broken for Kubernetes + CgroupsV2
jcodybaker opened this issue · comments
Description
When gVisor runs under Kubernetes w/ cgroups v2 enabled guest OOMs are reported as exit code either exit code 128 or 143 (SIGTERM + 128) and the OCI OOM event is not published.
In this configuration, gVisor runs as a child (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice/cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope
) of the pod's cgroup (ex. /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice
). This child cgroup doesn't specify limits itself, but enforcement is inherited from the pod's cgroup. gVisor watches for OOMs with an inotify on the child cgroups memory.events
file. It seems like these events must not propagate to the child.
I've been able to illustrate this with tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs
to display the various cgroup membership and memory.events. Given the child cgroup is torn down immediately after gVisor exits, it's possible that memory.events is updated but the update is missed or incorrectly handled by gVisor. That said it shows 0 for all values include max
which makes me suspect it's not considered since there are no limits.
$ pwd
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode015d02d_42fa_4cac_bbb0_5c23c946423e.slice
$ tail -n +1 memory.events cgroup.procs cri-containerd-*/memory.events cri-containerd-*/cgroup.procs
==> memory.events <==
low 0
high 0
max 566
oom 23
oom_kill 23
oom_group_kill 0
==> cgroup.procs <==
==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/memory.events <==
low 0
high 0
max 0
oom 0
oom_kill 0
oom_group_kill 0
==> cri-containerd-5de40e072687280064e120f4b9134cc308d956c7ca22545a36b81898d1e7d719.scope/cgroup.procs <==
139422
139423
139452
139473
139540
Steps to reproduce
Kubernetes + gVisor + cgroups v2-based OS (debian bookworm)
https://gist.github.com/jcodybaker/dda983722831263536be04538e5eb7de
Create a pod which exceeds the memory available.
cat << 'EOF' | kubectl create -f -
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: example
name: example
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: example
template:
metadata:
labels:
app: example
spec:
containers:
- command:
- bash
- -c
- big_var=data; while true; do big_var="$big_var$big_var"; done
image: ubuntu:jammy
name: ubuntu
resources:
limits:
cpu: "1"
memory: 512Mi
requests:
cpu: 200m
ephemeral-storage: 200M
memory: "214748364"
dnsPolicy: Default
hostNetwork: true
restartPolicy: Always
runtimeClassName: gvisor-ptrace
tolerations:
- operator: Exists
EOF
Wait for the pod to crash.
Then inspect its status:
Last State: Terminated
Reason: Error
Exit Code: 128
Started: Thu, 16 Nov 2023 10:12:13 -0500
Finished: Thu, 16 Nov 2023 10:12:18 -0500
runsc version
runsc version release-20231106.0
spec: 1.1.0-rc.1
docker version (if using docker)
No response
uname
Linux node 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07) x86_64 GNU/Linux
kubectl (if using Kubernetes)
v1.28.2
repo state (if built from source)
No response
runsc debug logs (if available)
No response
A friendly reminder that this issue had no activity for 120 days.