walinuxagent OOM killed
deepak7340 opened this issue · comments
We are getting every hour the oom-killer alert.
- The version we are using 2.2.46.
- There is no swap created in the server and we don't want to create that.
- There is lots of free memory available but still the OOM error is showing
- Can we increase the memory limit from CGroups.
Distro
waagent --version
import imp
WALinuxAgent-2.2.46 running on ubuntu 20.04
Python: 3.8.10
Goal state agent: 2.9.0.4
The kernel logs is.
# journalctl --list-boots | awk '{ print $1 }' | xargs -I{} journalctl --no-pager -b {} -kqg 'killed process' -o verbose --output-fields=MESSAGE
Thu 2023-02-02 11:49:13.015489 CET [s=1d8f87cf60064ba690c44490f14bb3e3;i=10823659;b=796d12a513054cf1bbde201dd1bca9a1;m=1f6726955fe3;t=5f3b550c118c1;x=e5347f6b24873f0c]
MESSAGE=Memory cgroup out of memory: Killed process 2855014 (tail) total-vm:4296kB, anon-rss:60kB, file-rss:516kB, shmem-rss:0kB, UID:0 pgtables:40kB oom_score_adj:0
Thu 2023-02-02 12:49:25.123048 CET [s=1d8f87cf60064ba690c44490f14bb3e3;i=10832491;b=796d12a513054cf1bbde201dd1bca9a1;m=1f67fde1c30a;t=5f3b6280d7be8;x=54f065dd48ba7646]
MESSAGE=Memory cgroup out of memory: Killed process 2867622 (python3) total-vm:101920kB, anon-rss:14452kB, file-rss:10080kB, shmem-rss:0kB, UID:0 pgtables:100kB oom_score_adj:0
Thu 2023-02-02 13:49:38.815836 CET [s=1d8f87cf60064ba690c44490f14bb3e3;i=108412dd;b=796d12a513054cf1bbde201dd1bca9a1;m=1f68d546567e;t=5f3b6ff720f5c;x=c78f55781b0a39b3]
MESSAGE=Memory cgroup out of memory: Killed process 2889539 (python3) total-vm:101920kB, anon-rss:14464kB, file-rss:10184kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:0
# journalctl -t kernel | grep kill
Feb 02 12:49:25 kernel: tail invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
Feb 02 12:49:25 kernel: oom_kill_process.cold+0xb/0x10
Feb 02 12:49:25 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task=python3,pid=2867622,uid=0
Feb 02 13:49:38 kernel: tail invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
Feb 02 13:49:38 kernel: oom_kill_process.cold+0xb/0x10
Feb 02 13:49:38 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task=python3,pid=2889539,uid=0
waagent.log
2023-02-02T12:14:22.623563Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 5;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
2023-02-02T12:43:56.673953Z INFO ExtHandler ExtHandler No requested version specified, checking for all versions for agent update (family: Prod)
2023-02-02T12:44:27.166812Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 6;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
2023-02-02T12:49:25.857366Z INFO CollectLogsHandler ExtHandler Starting log collection...
2023-02-02T12:49:34.599873Z INFO MainThread LogCollector Running log collector mode normal
2023-02-02T12:49:34.801452Z INFO MainThread LogCollector WireServer endpoint 168.63.129.16 read from file
2023-02-02T12:49:34.803517Z INFO MainThread LogCollector Wire server endpoint:168.63.129.16
2023-02-02T12:49:34.899812Z INFO MainThread LogCollector Forcing an update of the goal state.
2023-02-02T12:49:35.003941Z INFO MainThread Fetched a new incarnation for the WireServer goal state [incarnation 41]
2023-02-02T12:49:35.200625Z INFO MainThread LogCollector HostGAPlugin version: 1.0.8.136
2023-02-02T12:49:35.203481Z INFO MainThread
2023-02-02T12:49:35.299911Z INFO MainThread Fetched new vmSettings [HostGAPlugin correlation ID: 864e43ae-1cda-4754-ae1b-c4200d9a30c2 eTag: 10629567585258432043 source: Fabric]
2023-02-02T12:49:35.303669Z INFO MainThread The vmSettings originated via Fabric; will ignore them.
2023-02-02T12:49:35.404418Z INFO MainThread
2023-02-02T12:49:35.406345Z INFO MainThread Fetching full goal state from the WireServer [incarnation 41]
2023-02-02T12:49:36.505448Z INFO MainThread Downloaded certificate {'thumbprint': 'B848D8F49034A3BD567DB4795F1D2A185250BC42', 'hasPrivateKey': True}
2023-02-02T12:49:36.604358Z INFO MainThread Fetch goal state completed
2023-02-02T12:49:36.804138Z INFO LogCollectorMonitorHandler LogCollector Could not find swap counter from "memory.stat" file in the cgroup: /sys/fs/cgroup/memory/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice. Internal error: Cannot find counter: swap
2023-02-02T12:49:39.299757Z INFO CollectLogsHandler ExtHandler Log Collector exited with code -9
2023-02-02T13:14:31.429988Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 7;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
2023-02-02T13:44:01.372356Z INFO ExtHandler ExtHandler No requested version specified, checking for all versions for agent update (family: Prod)
2023-02-02T13:44:31.598334Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 8;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
@deepak7340 Thanks for reporting the issue, we will work on fix. For Mitigation, you can change the following settings.
-> Edit the file /lib/systemd/system/azure-walinuxagent-logcollector.slice
and you will find content something like this
# cat /lib/systemd/system/azure-walinuxagent-logcollector.slice
[Unit]
Description=Slice for Azure VM Agent Periodic Log Collector
DefaultDependencies=no
Before=slices.target
[Slice]
CPUAccounting=yes
CPUQuota=5%
MemoryAccounting=yes
MemoryLimit=30M
-> Remove the MemoryLimit=30M entry from the file and save it.
-> Run systemctl daemon-reload
Hi guys
Thanks @deepak7340 for raising this issue, I am currently working with a customer who is also facing this error in their environment when running version 2.9.0.4
.
@nagworld9 In my case, the agent is part of an AKS cluster so we can't (shouldn't really) adjust the slice manually - as such, it would be highly beneficial to have this issue resolved upstream.
Would you be able to provide an ETA for the fix?
We made a fix #2757 and will take a while to reach to the production.
closing, since this was fixed by #2757