Azure / WALinuxAgent

Microsoft Azure Linux Guest Agent

Home Page:http://azure.microsoft.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG][RHEL7.9] walinuxagent hang

Klaas- opened this issue · comments

Describe the bug: A clear and concise description of what the bug is.

Note: Please add some context which would help us understand the problem better

  1. there are no logs, the process simply hangs in a very uninterruptable state. I tried to gcore it even that hangs.
  2. Last sign of life in journal/waagent.log: INFO ExtHandler ExtHandler All extensions in the goal state have reached a terminal state: [(u'Microsoft.Azure.ActiveDirectory.AADSSHLoginForLinux', u'Ready'), (u'Microsoft.Azure.Monitoring.DependencyAgent.DependencyAgentLinux', u'Ready'), (u'Microsoft.Azure.RecoveryServices.VMSnapshotLinux', u'success'), (u'Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux', u'success')]

Distro and WALinuxAgent details (please complete the following information):

  • Distro and Version: Red Hat Enterprise Linux Server release 7.9 (Maipo)
  • WALinuxAgent version:
WALinuxAgent-2.3.0.2 running on redhat 7.9
Python: 2.7.5
Goal state agent: 2.8.0.11

Additional context
I have opened a support case: 2209290050001897

Log file attached
If possible, please provide the full /var/log/waagent.log file to help us understand the problem better and get the context of the issue.
Last signs of life:

2022-09-25T18:29:12.141284Z INFO ExtHandler ExtHandler Started extension in unit 'enable_id.scope'
2022-09-25T18:29:12.142212Z INFO ExtHandler ExtHandler Started tracking cgroup Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.19 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux_1.14.19.slice]
2022/09/25 20:29:12 OmsAgentForLinux started to handle.
2022/09/25 20:29:12 [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.19] cwd is /var/lib/waagent/Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.19
2022/09/25 20:29:12 [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.19] Change log file to /var/log/azure/Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux/extension.log
2022-09-25T18:29:20.150412Z INFO ExtHandler [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.19] Command: omsagent_shim.sh -enable
[stdout]
2022/09/25 20:29:18 [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.14.19] Enable,success,0,Enable succeeded
[stderr]
Running scope as unit enable_id.scope.
2022-09-25T18:29:20.219074Z INFO ExtHandler [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0] Target handler state: enabled [incarnation_177]
2022-09-25T18:29:20.219954Z INFO ExtHandler [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0] [Enable] current handler state is: enabled
2022-09-25T18:29:20.220087Z INFO ExtHandler [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0] Update settings file: 586.settings
2022-09-25T18:29:20.222457Z INFO ExtHandler [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0] Requested extension state: enabled
2022-09-25T18:29:20.222728Z INFO ExtHandler [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0] Enable extension: [main/handle.sh enable]
2022-09-25T18:29:20.223275Z INFO ExtHandler [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0] Executing command: /var/lib/waagent/Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0/main/handle.sh enable with environment variables: {"AZURE_GUEST_AGENT_UNINSTALL_CMD_EXIT_CODE": "NOT_RUN", "AZURE_GUEST_AGENT_EXTENSION_VERSION": "1.0.9188.0", "AZURE_GUEST_AGENT_EXTENSION_SUPPORTED_FEATURES": "[{\"Value\": \"1.0\", \"Key\": \"ExtensionTelemetryPipeline\"}]", "AZURE_GUEST_AGENT_EXTENSION_PATH": "/var/lib/waagent/Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0", "ConfigSequenceNumber": "586", "AZURE_GUEST_AGENT_WIRE_PROTOCOL_ADDRESS": "168.63.129.16"}
2022-09-25T18:29:20.231953Z INFO ExtHandler ExtHandler Started extension in unit 'enable_id2.scope'
2022-09-25T18:29:20.233279Z INFO ExtHandler ExtHandler Started tracking cgroup Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.Azure.RecoveryServices.VMSnapshotLinux_1.0.9188.0.slice]
2022-09-25T18:29:26.239037Z INFO ExtHandler [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0.9188.0] Command: main/handle.sh enable
[stdout]
2022/09/25 20:29:24 [Microsoft.Azure.RecoveryServices.VMSnapshotLinux-1.0][{"status": {"status": "success", "code": "4", "snapshotInfo": null, "name": "Microsoft.Azure.RecoveryServices.VMSnapshotLinux", .... ]
[stderr]
Running scope as unit enable_id2.scope.
2022-09-25T18:29:26.250384Z INFO ExtHandler ExtHandler ProcessExtensionsGoalState completed [incarnation_177 18460 ms]

2022-09-25T18:29:26.272345Z INFO ExtHandler ExtHandler Extension status: [(u'Microsoft.Azure.ActiveDirectory.AADSSHLoginForLinux', u'Ready'), (u'Microsoft.Azure.Monitoring.DependencyAgent.DependencyAgentLinux', u'Ready'), (u'Microsoft.Azure.RecoveryServices.VMSnapshotLinux', u'success'), (u'Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux', u'success')]
2022-09-25T18:29:26.273226Z INFO ExtHandler ExtHandler All extensions in the goal state have reached a terminal state: [(u'Microsoft.Azure.ActiveDirectory.AADSSHLoginForLinux', u'Ready'), (u'Microsoft.Azure.Monitoring.DependencyAgent.DependencyAgentLinux', u'Ready'), (u'Microsoft.Azure.RecoveryServices.VMSnapshotLinux', u'success'), (u'Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux', u'success')]

Full log attached to Microsoft Support case

@Klaas- what is the urn for the azure marketplace image you are using? thanks

Hi @narrieta ,

I have one VM that has seen this behaviour at least two times and one where it happened once.

            "imageReference": {
                "id": "",
                "offer": "RHEL",
                "publisher": "RedHat",
                "sku": "7-LVM",
                "version": "latest"

is what metadata tells me.

I can tell you they've been installed on 20210218 and on 20210412 so it should be the then current RHEL LVM Image.

Greetings
Klaas

And this is what happens when I restart it via systemctl:

Oct 06 09:30:43 hostname2 systemd[1]: Stopping Azure Linux Agent...
Oct 06 09:32:14 hostname2 systemd[1]: waagent.service stop-sigterm timed out. Killing.
Oct 06 09:33:44 hostname2 systemd[1]: waagent.service still around after SIGKILL. Ignoring.
Oct 06 09:35:14 hostname2 systemd[1]: waagent.service stop-final-sigterm timed out. Killing.
Oct 06 09:36:44 hostname2 systemd[1]: waagent.service still around after final SIGKILL. Entering failed mode.
Oct 06 09:36:44 hostname2 systemd[1]: Stopped Azure Linux Agent.
Oct 06 09:36:44 hostname2 systemd[1]: Unit waagent.service entered failed state.
Oct 06 09:36:44 hostname2 systemd[1]: waagent.service failed.
Oct 06 09:36:44 hostname2 systemd[1]: Started Azure Linux Agent.

@Klaas- Thanks for the info. I create a VM using RedHat:RHEL:7-LVM:latest and I did not see the hang. The status of the service is running and when I tried to execute an extension everything worked just fine.

● waagent.service - Azure Linux Agent
   Loaded: loaded (/usr/lib/systemd/system/waagent.service; enabled; vendor preset: disabled)
  Drop-In: /usr/lib/systemd/system/waagent.service.d

           └─10-Slice.conf, 11-CPUAccounting.conf, 12-CPUQuota.conf
   Active: active (running) since Fri 2022-10-07 04:00:20 UTC; 11min ago
 Main PID: 1521 (python)
   CGroup: /azure.slice/waagent.service
           ├─1521 /usr/bin/python -u /usr/sbin/waagent -daemon
           └─2693 python -u bin/WALinuxAgent-2.8.0.11-py2.7.egg -run-exthandlers

If you can get a core dump or stack trace of the hang, we may be able to help you.

@narrieta I tried to coredump the processes, but gdb is hanging, so I am unsure how to proceed with generating a coredump:

root     20334  0.0  0.2 160828  9944 ?        S    Sep29   0:00 gdb --nx --batch -ex set pagination off -ex set height 0 -ex set width 0 -ex attach 2125 -ex gcore daemon.core.2125 -ex detach -ex quit
root     21167  0.0  0.2 160828  9944 ?        S    Sep29   0:00 gdb --nx --batch -ex set pagination off -ex set height 0 -ex set width 0 -ex attach 2508 -ex gcore 2.8.0.11.core.2508 -ex detach -ex quit

So I can only provide stack from proc:

# cat /proc/2125/stack
[<ffffffff8e6a1366>] do_wait+0x1f6/0x260
[<ffffffff8e6a2580>] SyS_wait4+0x80/0x110
[<ffffffff8ed99f92>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/2508/stack
[<ffffffff8ed90516>] retint_careful+0x14/0x32
[<ffffffffffffffff>] 0xffffffffffffffff

Also I can't reproduce this behavior, I got hundred+ RHEL7 VMs only two have shown this, but one of them twice. So I am not surprised you didn't get an error with the VM you tried.

Workaround:
echo "-1" > /sys/fs/cgroup/cpu,cpuacct/azure.slice/waagent.service/cpu.cfs_quota_us