[RFE] Prevent the OOM killer to hit critical services
pothos opened this issue · comments
Current situation
When running low on memory Flatcar currently relies on the kernel's OOM killer to kill processes. Flatcar does not make use of systemd-oomd
yet. When the kernel kills processes, it can hit critical system services.
Impact
Hitting critical system services can render the system unresponsive as observed by @jepio.
Ideal future situation
Instead of killing processes as last resort we can use systemd-oomd to evaluate cgroups memory usage and terminate cgroups instead of single processes and do this earlier than the kernel would do to ensure that the system stays responsive. Terminating whole cgroups means that the action is more coordinated and impactful than killing random child or parent processes. Using the cgroup memory accounting means that the termination hits something that is responsible for the OOM than when the kernel OOM killer would do.
To prevent both the kernel OOM killer and systemd-oomd to hit critical services one can set OOMScoreAdjust=
and MemoryMin=
.
To steer the systemd-oomd towards killing a certain unit one can set ManagedOOMSwap=kill
and ManagedOOMMemoryPressure=kill
.
Implementation options
Enable systemd-oomd by default on Flatcar.
Set OOMScoreAdjust=
and MemoryMin=
for critical service units.
Set a drop-in for docker .scope
units to have ManagedOOMSwap=kill
and ManagedOOMMemoryPressure=kill
.
Additional information
Docker containers run under docker-….scope
which is part of system.slice
. The same is true for other user-defined workloads that don't spawn new cgroups directly under the root slice. Therefore, setting protections for the system slice is probably too broad and we would really have to identify which units we need to keep running and maintain this "allow list" as long as the upstream units don't set the OOMScoreAdjust=
and MemoryMin=
already.
We move workloads into a slice to avoid them breaking the system.
Been doing it for a couple years atp, never got to having crashes of Flatcar/OS components.
@till can you share the details of your config? we might draw inspiration from that
@jepio We do this for docker currently, so we configure cgroup-parent
in /etc/docker/daemon.json
.
The slice itself looks similar to this:
# https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html
[Slice]
CPUAccounting=yes
CPUQuota={{ cpu_quota_percent }}%
MemoryAccounting=yes
# Systemd > 231 (ignored for older versions)
MemoryHigh={{ memory_high_percent }}%
MemoryMax={{ memory_max_percent }}%
MemorySwapMax=0
# Systemd 219, as on CoreOS7
MemoryLimit={{ memory_limit_mb }}M
[Install]
Before=docker.service