[RFE] Prevent the OOM killer to hit critical services

Question

[RFE] Prevent the OOM killer to hit critical services

pothos opened this issue 3 months ago · comments

Current situation

When running low on memory Flatcar currently relies on the kernel's OOM killer to kill processes. Flatcar does not make use of systemd-oomd yet. When the kernel kills processes, it can hit critical system services.

Impact

Hitting critical system services can render the system unresponsive as observed by @jepio.

Ideal future situation

Instead of killing processes as last resort we can use systemd-oomd to evaluate cgroups memory usage and terminate cgroups instead of single processes and do this earlier than the kernel would do to ensure that the system stays responsive. Terminating whole cgroups means that the action is more coordinated and impactful than killing random child or parent processes. Using the cgroup memory accounting means that the termination hits something that is responsible for the OOM than when the kernel OOM killer would do.

To prevent both the kernel OOM killer and systemd-oomd to hit critical services one can set OOMScoreAdjust= and MemoryMin=.
To steer the systemd-oomd towards killing a certain unit one can set ManagedOOMSwap=kill and ManagedOOMMemoryPressure=kill.

Implementation options

Enable systemd-oomd by default on Flatcar.
Set OOMScoreAdjust= and MemoryMin= for critical service units.
Set a drop-in for docker .scope units to have ManagedOOMSwap=kill and ManagedOOMMemoryPressure=kill.

Additional information

Docker containers run under docker-….scope which is part of system.slice. The same is true for other user-defined workloads that don't spawn new cgroups directly under the root slice. Therefore, setting protections for the system slice is probably too broad and we would really have to identify which units we need to keep running and maintain this "allow list" as long as the upstream units don't set the OOMScoreAdjust= and MemoryMin= already.

Till! · Answer 1 · Tue Apr 16 2024 17:27:51 GMT+0800 (China Standard Time)

We move workloads into a slice to avoid them breaking the system.

Been doing it for a couple years atp, never got to having crashes of Flatcar/OS components.

Jeremi Piotrowski · Answer 2 · Wed Apr 17 2024 16:31:17 GMT+0800 (China Standard Time)

@till can you share the details of your config? we might draw inspiration from that

Till! · Answer 3 · Wed Apr 17 2024 18:16:48 GMT+0800 (China Standard Time)

@jepio We do this for docker currently, so we configure cgroup-parent in /etc/docker/daemon.json.

The slice itself looks similar to this:

# https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html
[Slice]
CPUAccounting=yes
CPUQuota={{ cpu_quota_percent }}%
MemoryAccounting=yes
# Systemd > 231 (ignored for older versions)
MemoryHigh={{ memory_high_percent }}%
MemoryMax={{ memory_max_percent }}%
MemorySwapMax=0
# Systemd 219, as on CoreOS7
MemoryLimit={{ memory_limit_mb }}M

[Install]
Before=docker.service