Azure / WALinuxAgent

Microsoft Azure Linux Guest Agent

Home Page:http://azure.microsoft.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

walinuxagent OOM killed

deepak7340 opened this issue · comments

We are getting every hour the oom-killer alert.

  1. The version we are using 2.2.46.
  2. There is no swap created in the server and we don't want to create that.
  3. There is lots of free memory available but still the OOM error is showing
  4. Can we increase the memory limit from CGroups.

Distro

 waagent --version
import imp
WALinuxAgent-2.2.46 running on ubuntu 20.04
Python: 3.8.10
Goal state agent: 2.9.0.4

The kernel logs is.

# journalctl --list-boots |     awk '{ print $1 }' |     xargs -I{} journalctl --no-pager -b {} -kqg 'killed process' -o verbose --output-fields=MESSAGE

Thu 2023-02-02 11:49:13.015489 CET [s=1d8f87cf60064ba690c44490f14bb3e3;i=10823659;b=796d12a513054cf1bbde201dd1bca9a1;m=1f6726955fe3;t=5f3b550c118c1;x=e5347f6b24873f0c]
    MESSAGE=Memory cgroup out of memory: Killed process 2855014 (tail) total-vm:4296kB, anon-rss:60kB, file-rss:516kB, shmem-rss:0kB, UID:0 pgtables:40kB oom_score_adj:0
Thu 2023-02-02 12:49:25.123048 CET [s=1d8f87cf60064ba690c44490f14bb3e3;i=10832491;b=796d12a513054cf1bbde201dd1bca9a1;m=1f67fde1c30a;t=5f3b6280d7be8;x=54f065dd48ba7646]
    MESSAGE=Memory cgroup out of memory: Killed process 2867622 (python3) total-vm:101920kB, anon-rss:14452kB, file-rss:10080kB, shmem-rss:0kB, UID:0 pgtables:100kB oom_score_adj:0
Thu 2023-02-02 13:49:38.815836 CET [s=1d8f87cf60064ba690c44490f14bb3e3;i=108412dd;b=796d12a513054cf1bbde201dd1bca9a1;m=1f68d546567e;t=5f3b6ff720f5c;x=c78f55781b0a39b3]
    MESSAGE=Memory cgroup out of memory: Killed process 2889539 (python3) total-vm:101920kB, anon-rss:14464kB, file-rss:10184kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:0


# journalctl -t kernel | grep kill

Feb 02 12:49:25  kernel: tail invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
Feb 02 12:49:25 kernel:  oom_kill_process.cold+0xb/0x10
Feb 02 12:49:25 kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task=python3,pid=2867622,uid=0
Feb 02 13:49:38  kernel: tail invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=0
Feb 02 13:49:38  kernel:  oom_kill_process.cold+0xb/0x10
Feb 02 13:49:38  kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task_memcg=/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice,task=python3,pid=2889539,uid=0

waagent.log

2023-02-02T12:14:22.623563Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 5;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
2023-02-02T12:43:56.673953Z INFO ExtHandler ExtHandler No requested version specified, checking for all versions for agent update (family: Prod)
2023-02-02T12:44:27.166812Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 6;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
2023-02-02T12:49:25.857366Z INFO CollectLogsHandler ExtHandler Starting log collection...
2023-02-02T12:49:34.599873Z INFO MainThread LogCollector Running log collector mode normal
2023-02-02T12:49:34.801452Z INFO MainThread LogCollector WireServer endpoint 168.63.129.16 read from file
2023-02-02T12:49:34.803517Z INFO MainThread LogCollector Wire server endpoint:168.63.129.16
2023-02-02T12:49:34.899812Z INFO MainThread LogCollector Forcing an update of the goal state.
2023-02-02T12:49:35.003941Z INFO MainThread Fetched a new incarnation for the WireServer goal state [incarnation 41]
2023-02-02T12:49:35.200625Z INFO MainThread LogCollector HostGAPlugin version: 1.0.8.136
2023-02-02T12:49:35.203481Z INFO MainThread
2023-02-02T12:49:35.299911Z INFO MainThread Fetched new vmSettings [HostGAPlugin correlation ID: 864e43ae-1cda-4754-ae1b-c4200d9a30c2 eTag: 10629567585258432043 source: Fabric]
2023-02-02T12:49:35.303669Z INFO MainThread The vmSettings originated via Fabric; will ignore them.
2023-02-02T12:49:35.404418Z INFO MainThread
2023-02-02T12:49:35.406345Z INFO MainThread Fetching full goal state from the WireServer [incarnation 41]
2023-02-02T12:49:36.505448Z INFO MainThread Downloaded certificate {'thumbprint': 'B848D8F49034A3BD567DB4795F1D2A185250BC42', 'hasPrivateKey': True}
2023-02-02T12:49:36.604358Z INFO MainThread Fetch goal state completed
2023-02-02T12:49:36.804138Z INFO LogCollectorMonitorHandler LogCollector Could not find swap counter from "memory.stat" file in the cgroup: /sys/fs/cgroup/memory/azure.slice/azure-walinuxagent.slice/azure-walinuxagent-logcollector.slice. Internal error: Cannot find counter: swap
2023-02-02T12:49:39.299757Z INFO CollectLogsHandler ExtHandler Log Collector exited with code -9
2023-02-02T13:14:31.429988Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 7;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
2023-02-02T13:44:01.372356Z INFO ExtHandler ExtHandler No requested version specified, checking for all versions for agent update (family: Prod)
2023-02-02T13:44:31.598334Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.9.0.4 is running as the goal state agent [DEBUG HeartbeatCounter: 8;HeartbeatId: C244CC71-FB11-491E-B33B-4E7E34222592;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]

@deepak7340 Thanks for reporting the issue, we will work on fix. For Mitigation, you can change the following settings.

-> Edit the file /lib/systemd/system/azure-walinuxagent-logcollector.slice and you will find content something like this

# cat /lib/systemd/system/azure-walinuxagent-logcollector.slice
[Unit]
Description=Slice for Azure VM Agent Periodic Log Collector
DefaultDependencies=no
Before=slices.target
[Slice]
CPUAccounting=yes
CPUQuota=5%
MemoryAccounting=yes
MemoryLimit=30M

-> Remove the MemoryLimit=30M entry from the file and save it.
-> Run systemctl daemon-reload

Hi guys

Thanks @deepak7340 for raising this issue, I am currently working with a customer who is also facing this error in their environment when running version 2.9.0.4.

@nagworld9 In my case, the agent is part of an AKS cluster so we can't (shouldn't really) adjust the slice manually - as such, it would be highly beneficial to have this issue resolved upstream.

Would you be able to provide an ETA for the fix?

We made a fix #2757 and will take a while to reach to the production.

closing, since this was fixed by #2757