rfjakob / earlyoom

earlyoom - Early OOM Daemon for Linux

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Kills multiple processes at once

haarp opened this issue · comments

commented

Pretty much every time earlyoom triggers, it tends to kill multiple processes within the same second.

Example (running with -p -m 5,2 -s 100,100 -N logger:

Feb 27 09:34:06 host root[7509]: dialog-warning earlyoom Low memory! Killing process 32194 vmware_vmx
Feb 27 09:34:06 host root[7511]: dialog-warning earlyoom Low memory! Killing process 26773 firefox

Each of these processes would've been enough to free plenty of memory on their own. Wouldn't it be more sensible to minimize damage by killing one process first and waiting just a little to see if the situation improved, before moving on to the next process?

@haarp Earlyoom sleeps 1 sec after each corrective action (main.c:375):

            // With swap enabled, the kernel seems to need more than 100ms to free the memory
            // of the killed process. This means that earlyoom would immediately kill another
            // process. Sleep a little extra to give the kernel time to free the memory.
            // (Yes, this will sleep even if the kill has failed. Does no harm and keeps the
            // code simple.)
            if (m.SwapTotalMiB > 0) {
                usleep(cooldown_ms * 1000);

You can fix sources and increase sleep duration. Or just try to use nohang (it has config with configurable sleep durations).

@rfjakob

Same problem:

Thanks. I've been running earlyoom but when I overcommit memory a lot, it will start to be too trigger happy.

https://unix.stackexchange.com/questions/423261/how-to-avoid-high-latency-near-oom-situation#comment936770_506974

Maybe increasing sleep times after corrective actions is good idea.

killdelay.sh

#!/bin/bash
set -eu
PID=$1
kill $PID
for i in $(seq 1 20) ; do
grep MemAvailable /proc/meminfo
sleep 0.1
done

Looks like it takes > 1s to return the 18GB to MemAvailable:

$ sudo ./killdelay.sh 10030
(interval 0.1s)
MemAvailable:    3466280 kB
MemAvailable:    3466280 kB
MemAvailable:    3466280 kB
MemAvailable:    3466280 kB
MemAvailable:    3466036 kB
MemAvailable:    3466036 kB
MemAvailable:    3466036 kB
MemAvailable:    3461248 kB
MemAvailable:    3461248 kB
MemAvailable:    3461248 kB
MemAvailable:    5322496 kB
MemAvailable:    8029712 kB
MemAvailable:   10724496 kB
MemAvailable:   13429832 kB
MemAvailable:   16145336 kB
MemAvailable:   18859064 kB
MemAvailable:   21544860 kB
MemAvailable:   22372484 kB
MemAvailable:   22372596 kB
MemAvailable:   22372632 kB

Looks like we can detect this situation. Again, in 0.1 seconds interval. We see that /proc/[pid]/statm returns all zeros while the kernel is cleaning up the process. When /proc/[pid] disappears, all the memory has been returned to the system.

1 0 0 MemAvailable:    3257276 kB VmRSS:	18875824 kB statm: 4719170 4718956 351 4 0 4718670 0
1 0 0 MemAvailable:    3257276 kB VmRSS:	18875824 kB statm: 4719170 4718956 351 4 0 4718670 0
1 0 0 MemAvailable:    3256400 kB VmRSS:	18875824 kB statm: 4719170 4718956 351 4 0 4718670 0
***sigkill***
0 0 0 MemAvailable:    3256852 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3256836 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3257136 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3263624 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3263832 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3264032 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3264032 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3264032 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    3264020 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    5299464 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:    8073872 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:   10852200 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:   13666400 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:   16456496 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:   19241276 kB  statm: 0 0 0 0 0 0 0
0 0 0 MemAvailable:   21993804 kB  statm: 0 0 0 0 0 0 0
***/proc/pid disappears***
0 1 1 MemAvailable:   22172808 kB  statm: 
0 1 1 MemAvailable:   22172920 kB  statm: 

Fixed in earlyoom v1.3-rc1. We now wait for the process to release all memory before checking to to see if we should kill another one.

commented

Nice, thank you!

@rfjakob

mem avail: 1863 of 13855 MiB (13 %), swap free: 0 of 0 MiB ( 0 %)
mem avail: 1676 of 13855 MiB (12 %), swap free: 0 of 0 MiB ( 0 %)
mem avail: 1523 of 13855 MiB (10 %), swap free: 0 of 0 MiB ( 0 %)
low memory! at or below SIGTERM limits: mem 10 %, swap 10 %
killing process 27366 "bash" with signal 15: badness 539, VmRSS 7464 MiB
timeout waiting for pid 27366 after signal 15
kill failed: Success. Sleeping 1 second
mem avail: 8 of 13855 MiB ( 0 %), swap free: 0 of 0 MiB ( 0 %)
low memory! at or below SIGKILL limits: mem 5 %, swap 5 %
killing process 27366 "bash" with signal 9: badness 678, VmRSS 9380 MiB
process 27366 exited after 0.3 seconds
mem avail: 9402 of 13855 MiB (67 %), swap free: 0 of 0 MiB ( 0 %)
mem avail: 9399 of 13855 MiB (67 %), swap free: 0 of 0 MiB ( 0 %)
mem avail: 9392 of 13855 MiB (67 %), swap free: 0 of 0 MiB ( 0 %)
mem avail: 9387 of 13855 MiB (67 %), swap free: 0 of 0 MiB ( 0 %)

mem avail: 8 of 13855 MiB ( 0 %), swap free: 0 of 0 MiB ( 0 %)

How do you like it?

kill failed: Success.

What does it mean?

mem avail:  1860 of 13855 MiB (13 %), swap free:    0 of    0 MiB ( 0 %)
mem avail:  1662 of 13855 MiB (11 %), swap free:    0 of    0 MiB ( 0 %)
mem avail:  1524 of 13855 MiB (10 %), swap free:    0 of    0 MiB ( 0 %)
low memory! at or below SIGTERM limits: mem 10 %, swap 10 %
killing process 27541 "bash" with signal 15: badness 550, VmRSS 7609 MiB
process 27541 exited after 1.9 seconds
mem avail:  4626 of 13855 MiB (33 %), swap free:    0 of    0 MiB ( 0 %)
mem avail:  1686 of 13855 MiB (12 %), swap free:    0 of    0 MiB ( 0 %)
апр 28 01:23:30 PC kernel: Out of memory: Kill process 27541 (bash) score 572 or sacrifice child
апр 28 01:23:30 PC kernel: Killed process 27541 (bash) total-vm:8118236kB, anon-rss:8098448kB, file-rss:3264kB, shmem-rss:0kB

process 27541 exited after 1.9 seconds

Because it was killed by OOMK, not earlyoom.

Nice. I see you are blocking SIGTERM. How did you get bash to 7609 MiB of memory?

for i in {1..55555555}; do true; done

You can change the number (55555555) to your liking

Both the broken error message kill failed: Success and the problem that programs that block SIGTERM can delay earlyoom for 10 seconds should be fixed now, thanks.