Azure / WALinuxAgent

Microsoft Azure Linux Guest Agent

Home Page:http://azure.microsoft.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Run command script being killed during bicep deployment

dazinator opened this issue · comments

Describe the bug: A clear and concise description of what the bug is.

I am using Azure Bicep to deploy some "Run Command"s across a cluster of VM's which are also created in a prior step of the template.

I am being careful to ensure only a single Run Command is deployed to same VM at a time.
I see issues where scripts are failing to finish. I have used logging and checked the stdout and stderr logs output for each run command, and I can see that the script is terminating before executing the next line, like a simple echo statement for example.

So I checked the handler log and I see entries like this for the impacted run commands:

time=2023-01-17T15:55:35Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=glusterRunCreateV
olume seq=1 message="Execute with TimeoutInSeconds=1200"
time=2023-01-17T15:55:36Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=glusterRunCreateV
olume seq=1 message="Timeout:signal: killed"
time=2023-01-17T15:55:36Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=glusterRunCreateV
olume seq=1 event="failed to execute command" error="command terminated with exit status=-1" output=/var/li
b/waagent/run-command-handler/download/glusterRunCreateVolume/1
time=2023-01-17T15:55:36Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=glusterRunCreateV
olume seq=1 event="enable script failed"
time=2023-01-17T15:55:36Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=glusterRunCreateV
olume seq=1 event="failed to handle" error="failed to execute command: command terminated with exit status=
-1"

Note: this run command actually completed successfully the last time I ran this deployment, but another subsequent one failed. So it doesn't appear to be the script itself that causes the issue, its more like the waagent is killing scripts and exactly which scripts this impacts during the deployment is not guaranteed each time.

The timeout for these run commands is 600 seconds yet they are seemingly killed approx 1 second into execution.

Note: Please add some context which would help us understand the problem better

  1. Section of the log where the error occurs: See above

  2. Serial console output: I checked stdout and stderr for the specific runcommand but I can see only that the script is being killed at certain point rather than executing the next expected line. The exact point usually coincides with the script doing somthing like a sudo docker pull command or other command. There is no error in stderr

  3. Steps to reproduce the behavior.

Distro and WALinuxAgent details (please complete the following information):

  • Distro and Version: Ubuntu 20.04.5 LTS (GNU/Linux 5.15.0-1031-azure x86_64)
  • WALinuxAgent version:

/usr/sbin/waagent:27: DeprecationWarning: the imp module is deprecated in favour of imp
ortlib; see the module's documentation for alternative uses
import imp
WALinuxAgent-2.2.46 running on ubuntu 20.04
Python: 3.8.10
Goal state agent: 2.8.0.11

Additional context
I have a bicep deployment that deploys 3 VM's.
All of the 3 VM's have

  • Custom Script Extension
  • Azure AD Extension
    One of the VM's also has:
  • Network Watcher Extension

These extensions are installed after the VM's are created, by the template, and the Run Commands being deployed, depend on these resources completing first.

After the VM's are deployed (complete with VM extensions), I have about 5 different bicep modules, each one will run a series of run commands on those VM's. For example, the first module will install docker on each VM. The second module will create a docker swarm cluster on the first vm, then join that cluster from the other vm's. The modules run in sequence (not concurrently) and the I am careful to ensure that only one run command is run at a time on the target VM, before the next one is run.

Log file attached

It looks like there could be sensitive info in the /var/log/waagent.log file, so as a first pass I have pulled what I thought might be a relevent sections of it - here the devopsDeploymentAgent was a run command shown in the handler log above which was killed:

2023-01-17T16:01:22.298092Z INFO ExtHandler ExtHandler Started extension in unit 'enable_69cfbef9-ec33-4b4c-85e8-0cd737a29cf7.scope
'
2023-01-17T16:01:24.300601Z INFO ExtHandler [Microsoft.CPlat.Core.RunCommandHandlerLinux.devopsDeploymentAgent-1.3.2] Command: bin/
run-command-shim enable
[stdout]
y: run-command-handler
Writing a placeholder status file indicating progress before forking: /devopsDeploymentAgent.1.status
+ nohup /var/lib/waagent/Microsoft.CPlat.Core.RunCommandHandlerLinux-1.3.2/bin/run-command-handler enable
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event=start
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event=pre-che
ck
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="compar
ing seqnum" path=devopsDeploymentAgent.mrseq
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="seqnum
 saved" path=devopsDeploymentAgent.mrseq
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="readin
g configuration from /var/lib/waagent/Microsoft.CPlat.Core.RunCommandHandlerLinux-1.3.2/config/devopsDeploymentAgent.1.settings"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="read c
onfiguration"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="valida
ting json schema"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="json s
chema valid"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="parsin
g configuration json"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="parsed
 configuration json"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="valida
ting configuration logically"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="valida
ted configuration"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="creati
ng output directory" path=/var/lib/waagent/run-command-handler/download/devopsDeploymentAgent/1
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="create
d output directory"
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 scriptUri=
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="execut
ing command" output=/var/lib/waagent/run-command-handler/download/devopsDeploymentAgent/1
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="prepar
e command" scriptFile=/var/lib/waagent/run-command-handler/download/devopsDeploymentAgent/1
[stderr]
Running scope as unit: enable_69cfbef9-ec33-4b4c-85e8-0cd737a29cf7.scope
2023-01-17T16:01:24.336409Z INFO ExtHandler [Microsoft.CPlat.Core.RunCommandHandlerLinux.dockerLogin-1.3.2] Target handler state: e
nabled [incarnation_31]


and this section at the end shows it thinks all the things finished successfully:

Running scope as unit: enable_1bdcea52-97e8-41b5-babc-629d51f04b0a.scope
2023-01-17T16:01:44.744694Z INFO ExtHandler ExtHandler ProcessExtensionsGoalState completed [incarnation_31 24534 ms]

2023-01-17T16:01:44.764356Z INFO ExtHandler ExtHandler Extension status: [('Microsoft.Azure.ActiveDirectory.AADSSHLoginForLinux', '
Ready'), ('Microsoft.Azure.Extensions.CustomScript', 'success'), ('Microsoft.Azure.NetworkWatcher.NetworkWatcherAgentLinux', 'Ready
'), ('addHostsRunCommand', 'success'), ('devopsDeploymentAgent', 'success'), ('dockerLogin', 'success'), ('glusterRunBrickPrepareDi
rectory', 'success'), ('glusterRunCreateVolume', 'success'), ('glusterRunMountVolume', 'success'), ('glusterRunPeerProbe', 'success
'), ('installDockerPluginGlusterFs', 'success'), ('swarmRunClusterImit', 'success')]
2023-01-17T16:01:44.764941Z INFO ExtHandler ExtHandler All extensions in the goal state have reached a terminal state: [('Microsoft
.Azure.ActiveDirectory.AADSSHLoginForLinux', 'Ready'), ('Microsoft.Azure.Extensions.CustomScript', 'success'), ('Microsoft.Azure.Ne
tworkWatcher.NetworkWatcherAgentLinux', 'Ready'), ('addHostsRunCommand', 'success'), ('devopsDeploymentAgent', 'success'), ('docker
Login', 'success'), ('glusterRunBrickPrepareDirectory', 'success'), ('glusterRunCreateVolume', 'success'), ('glusterRunMountVolume'
, 'success'), ('glusterRunPeerProbe', 'success'), ('installDockerPluginGlusterFs', 'success'), ('swarmRunClusterImit', 'success')]
2023-01-17T16:03:26.678695Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.addH
ostsRunCommand-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.Core.RunCo
mmandHandlerLinux.addHostsRunCommand_1.3.2.slice]
2023-01-17T16:03:26.678914Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.devo
psDeploymentAgent-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.Core.Ru
nCommandHandlerLinux.devopsDeploymentAgent_1.3.2.slice]
2023-01-17T16:03:26.678996Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.dock
erLogin-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.Core.RunCommandHa
ndlerLinux.dockerLogin_1.3.2.slice]
2023-01-17T16:03:26.679058Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.glus
terRunBrickPrepareDirectory-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPl
at.Core.RunCommandHandlerLinux.glusterRunBrickPrepareDirectory_1.3.2.slice]
2023-01-17T16:03:26.679121Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.glus
terRunCreateVolume-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.Core.R
unCommandHandlerLinux.glusterRunCreateVolume_1.3.2.slice]
2023-01-17T16:03:26.679180Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.glus
terRunMountVolume-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.Core.Ru
nCommandHandlerLinux.glusterRunMountVolume_1.3.2.slice]
2023-01-17T16:03:26.679238Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.glus
terRunPeerProbe-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.Core.RunC
ommandHandlerLinux.glusterRunPeerProbe_1.3.2.slice]
2023-01-17T16:03:26.679293Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.inst
allDockerPluginGlusterFs-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.
Core.RunCommandHandlerLinux.installDockerPluginGlusterFs_1.3.2.slice]
2023-01-17T16:03:26.679350Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.CPlat.Core.RunCommandHandlerLinux.swar
mRunClusterImit-1.3.2 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.CPlat.Core.RunC
ommandHandlerLinux.swarmRunClusterImit_1.3.2.slice]
2023-01-17T16:03:26.679405Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.Azure.ActiveDirectory.AADSSHLoginForLi
nux-1.0.2081.1 [/sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.Azure.ActiveDirectory.
AADSSHLoginForLinux_1.0.2081.1.slice]
2023-01-17T16:03:26.679461Z INFO MonitorHandler ExtHandler Stopped tracking cgroup Microsoft.Azure.Extensions.CustomScript-2.1.7 [/
sys/fs/cgroup/cpu,cpuacct/azure.slice/azure-vmextensions.slice/azure-vmextensions-Microsoft.Azure.Extensions.CustomScript_2.1.7.sli
ce]
2023-01-17T16:03:33.257874Z INFO ExtHandler ExtHandler [HEARTBEAT] Agent WALinuxAgent-2.8.0.11 is running as the goal state agent [
DEBUG HeartbeatCounter: 1;HeartbeatId: 52DDC11F-D9FB-4ABB-B019-60DE20A7997C;DroppedPackets: 0;UpdateGSErrors: 0;AutoUpdate: 1]
foo-i-replaced-this@swarm-mgr-01:~$ 

@dazinator The only instance in which the agent terminates extensions is the timeout that you mentioned, but this is not the case here. Your log shows that all extensions ran to completion and reported 'success' or 'Ready'.

The system log may have some information if your script was terminated, for example, by the OOM killer.

@narrieta thanks. I have worked around the issue by splitting my script which was doing a docker login then a docker pull then a docker run into three seperate run commands / scripts - either termination is no longer happening or its happening at the end of each smaller script allowing the script to run at least.

I will check sys logs.

Is it odd that this "[stderr]" shows in the waaagent log, yet it still reports the run command as completed successfully at the end? Or am I misinterpreting that, i.e it's not encountering an error from my run command here, and perhaps this is just the start of the stderr output stream in the log in general?

time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="execut
ing command" output=/var/lib/waagent/run-command-handler/download/devopsDeploymentAgent/1
time=2023-01-17T16:01:22Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=devopsDeploymentAgent seq=1 event="prepar
e command" scriptFile=/var/lib/waagent/run-command-handler/download/devopsDeploymentAgent/1
[stderr]

@dazinator "[stderr]" is just part of the formatting, but I see how that can be confusing. "[stderr]\n--no stderr--" would help?

@narrieta
The logs indicate that waagent is detecting the kill, irrespective of how it's being killed - is it right that it then goes on to see this as the script completing successfully? I would have thought that it should detect this case as a failure..

time=2023-01-17T15:55:36Z version=v1.3.2/git@6efb77e-clean operation=enable extensionName=glusterRunCreateV
olume seq=1 message="Timeout:signal: killed"

@dazinator That message is coming from the extension. The agent log should show the exit code of the extension process. Yes, I would also expect that the extension would report a timedout command as an error. You could post your question in the extension's github repo.