Azure / WALinuxAgent

Microsoft Azure Linux Guest Agent

Home Page:http://azure.microsoft.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] Agent breaks when deploying template with VM and runCommand

Hardstl opened this issue · comments

When using a Bicep template to deploy an Ubuntu VM while also doing a runCommand in the same template after the VM has been deployed, the agent completely breaks and doesn't seem to be able to repair itself. It's as if the agent never finishes setting up to begin with. Removing the runCommand resource from the template will make sure the agent is properly installed, and deploying the same template with runCommand after the VM has been provisioned works as it should. Restarting the walinuxagent service or the VM doesn't help either.

waagent --version
WALinuxAgent-2.2.45 running on ubuntu 18.04
Python: 3.6.9
Goal state agent: 2.2.45

Log file keeps retrying to connect. Same thing over and over.

2022/02/28 21:14:02.183944 INFO Daemon WireServer endpoint is not found. Rerun dhcp handler
2022/02/28 21:14:02.184409 INFO Daemon Test for route to 168.63.129.16
2022/02/28 21:14:02.187106 INFO Daemon Route to 168.63.129.16 exists
2022/02/28 21:14:02.187485 INFO Daemon Wire server endpoint:168.63.129.16
2022/02/28 21:14:02.200153 INFO Daemon Fabric preferred wire protocol version:2015-04-05
2022/02/28 21:14:02.200873 INFO Daemon Wire protocol version:2012-11-30
2022/02/28 21:14:02.201970 INFO Daemon Server preferred version:2015-04-05
2022/02/28 21:14:02.362426 INFO Daemon Found private key matching thumbprint 93D0A3DD1C418BEC2457ACF4412128541581F3E5
2022/02/28 21:14:02.445277 INFO Daemon Found private key matching thumbprint 93D0A3DD1C418BEC2457ACF4412128541581F3E5
2022/02/28 21:14:02.535893 INFO Daemon Found private key matching thumbprint 93D0A3DD1C418BEC2457ACF4412128541581F3E5
2022/02/28 21:14:02.549331 ERROR Daemon Exception processing goal state, giving up: [the JSON object must be str, bytes or bytearray, not 'NoneType']
2022/02/28 21:14:02.556208 INFO Daemon WireServer is not responding. Reset endpoint
2022/02/28 21:14:02.560111 INFO Daemon Protocol endpoint not found: WireProtocol, [ProtocolError] Exceeded max retry updating goal state
2022/02/28 21:14:02.569141 INFO Daemon Protocol endpoint not found: MetadataProtocol, [ProtocolError] 404 - GET: http://169.254.169.254/Microsoft.Compute/identity?api-version=2015-05-01-preview
2022/02/28 21:14:02.577300 INFO Daemon Retry detect protocols: retry=62

Steps to reproduce.

  1. Deploy VM and runCommand in the same template.
  2. Deploy will get stuck forever and runCommand is never executed and agent breaks.

Template used.

param location string = resourceGroup().location
param vmName string = 'VmTestRunCmd2'
param adminUsername string = 'hadmin'
@secure()
param adminPassword string
param vnetRGName string = 'a-natfw-rg'
param vnetName string = 'a-natfw-vnet'
param subnetBackend string = 'snet-pls'

resource vnet 'Microsoft.Network/virtualNetworks@2021-05-01' existing = {
  name: vnetName
  scope: resourceGroup(vnetRGName)
}

resource networkInterface 'Microsoft.Network/networkInterfaces@2020-11-01' = {
  name: '${vmName}-nic'
  location: location
  properties: {
    ipConfigurations: [
      {
        name: 'ipconfig1'
        properties: {
          privateIPAllocationMethod: 'Dynamic'
          subnet: {
            id: '${vnet.id}/subnets/${subnetBackend}'
          }
        }
      }
    ]
  }
}

resource virtualMachine 'Microsoft.Compute/virtualMachines@2020-12-01' = {
  name: vmName
  location: location
  properties: {
    hardwareProfile: {
      vmSize: 'Standard_A2_v2'
    }
    osProfile: {
      computerName: vmName
      adminUsername: adminUsername
      adminPassword: adminPassword
    }
    storageProfile: {
      imageReference: {
        publisher: 'Canonical'
        offer: 'UbuntuServer'
        sku: '18.04-LTS'
        version: 'latest'
      }
      osDisk: {
        name: '${vmName}-OSDisk'
        caching: 'ReadWrite'
        createOption: 'FromImage'
      }
    }
    networkProfile: {
      networkInterfaces: [
        {
          id: networkInterface.id
        }
      ]
    }
    diagnosticsProfile: {
      bootDiagnostics: {
        enabled: true
      }
    }
  }
}

resource runCommand 'Microsoft.Compute/virtualMachines/runCommands@2021-07-01' = {
  name: '${virtualMachine.name}/runCommandnow'
  location: location
  properties: {
    asyncExecution: false
    errorBlobUri: 'https://anatfwlogs.blob.core.windows.net/error/error.txt'
    outputBlobUri: 'https://anatfwlogs.blob.core.windows.net/error/output.txt'
    source: {
      script: 'touch /home/hadmin/test.txt'
    }
    timeoutInSeconds: 120
  }
}

vmagentnotready

vmproperties

@Hardstl This is was fixed some time ago. The agent version on your vm (2.2.45) is pretty old. I'd recommend you enable agent autoupdates to get the latest version (see AutoUpdate.Enabled https://github.com/Azure/WALinuxAgent)

@narrieta thanks for the information. Is it possible to upgrade the agent to the latest version for the image on the marketplace?

I will try using the Ubuntu 20.04 image for some tests.

@Hardstl The images in the marketplace are updated by their publishers, and they normally due it regularly. In the case of the Ubuntu base image the update would need to come from Canonical. If you enable AutoUpdate you would get the latest agent Azure publishes.

@narrieta I believe the AutoUpdate.Enabled is set to true as default, but please correct me if I'm wrong. However, the VM is never able to update the agent as it just breaks down before that happens.

@Hardstl your output of waagent -version indicates the agent is not autoupgrading, but regardless of that, you are right, this message is coming from the agent in the image. This makes the issue harder to debug. One thing you could do is to create a custom image with verbose logging enabled, or adding more information (a stack trace) to that that error message. If this is something you want to try I can post some info on how to do that.

@Hardstl I can try some queries on the backend that may provide more info about the issue. For that I'd need you to post the resource group name, vm name, and time of the failure. Do not post those if they are considered confidential.

@Hardstl Another thing you could try, if you enable the serial console on your VM, is to login to the console, enable verbose logging and restart the agent. If the issue persists we could get more info from the log.

@narrieta I did some additional tests using 20.04 and getting the same issue. I enabled verbose logging in /etc/waagent.conf and restarted the service. Attached the log, verbose starts at around row 586.

@Hardstl Thank you, I downloaded your log. I see the request that is causing the issue. Can you hold this machine? I can post a command line to replicate that request in this VM tomorrow morning. There is something in the response that is causing the error.

@Hardstl feel to stop the agent on this VM. The verbose log may grow too long.

@Hardstl This would be the request (you need to issue it as root)

wget \
    --header='x-ms-agent-name: WALinuxAgent' \
    --header='x-ms-version: 2012-11-30' \
    --header='Connection: close' \
    --header='User-Agent: WALinuxAgent/2.2.46' \
    'http://168.63.129.16/machine/xxx-xxx._VmTestRun2004?comp=config&type=extensionsConfig&incarnation=3'

you can get the full URL from your log, just 3 lines before the error message ("[the JSON object must be str, bytes or bytearray, not NoneType]"), it's the one that contans "extensionsConfig". Just be sure to decode the URL in the log.

This would return a piece of XML with some embedded json, do not post that here, since it may have some info about your VM. Something in that json is confusing the parsing code in the agent. If you can't spot the issue, you could compare it against a successful run of the extension.

@Hardstl BTW (that json is generated from the extension settings in your template, there may be a subtle issue in your template that I can't spot from the text you posted above)

@Hardstl Looking at the code for 2.2.45, likely the RuntimeSettings element in the XML document is empty. Could you verify that?

@narrieta this is the response from running the wget command. The VM is just a test VM in a MSDN subscription, so nothing really important.

hadmin@VmTestRun2004:~$ sudo wget \
>     --header='x-ms-agent-name: WALinuxAgent' \
>     --header='x-ms-version: 2012-11-30' \
>     --header='Connection: close' \
>     --header='User-Agent: WALinuxAgent/2.2.46' \
>  'http://168.63.129.16/machine/b056350c-03a1-4efc-9ef8-584053087d59/b6324691%2D6ee6%2D45f0%2Da84c%2D7002f3fc6459.%5FVmTestRun2004?comp=config&type=extensionsConfig&incarnation=3'
--2022-03-02 15:28:45--  http://168.63.129.16/machine/b056350c-03a1-4efc-9ef8-584053087d59/b6324691%2D6ee6%2D45f0%2Da84c%2D7002f3fc6459.%5FVmTestRun2004?comp=config&type=extensionsConfig&incarnation=3
Connecting to 168.63.129.16:80... connected.
HTTP request sent, awaiting response... 410 Gone
2022-03-02 15:28:45 ERROR 410: Gone.

How do I check if the RuntimeSettings is empty, which document?

@Hardstl Actually the request failed. 410 means the parameters to the request are no longer valid, likely the state of your VM has changed after the original failure. I think I saw "incarnation=3" in the request that failed on your VM. You would need to either try with the most recent error in your log, or try to reproduce the issue again and retry the request with the URL in the last error. If wget succeeds, it'll save the response to a file and it'll output a message saying "Saving to ...".

In any case, looking at the code for 2.2.45, the only json the agent tries to parse is given by the RuntimeSettings in the XML. From the error message it seems like it is missing altogether from the response. The current agent handles an empty value in a different way, and does not error out, which may explain why you don't see the error when you run the extension after the VM has been created (and the agent has autoupdated).

If indeed the RuntimeSettings are empty, most likely this would indicate an issue in your template/deployment. They should never be empty for RunCommand.

If you have a successful run, the current agent saves the response to that query in a file named /var/lib/waagent/ExtensionsConfig.*.xml; you could try checking the value of RuntimeSettings there.

@Hardstl I just noticed the url you used is not fully decoded, I am not sure if wget will try to decode it for you, but if it doesn't then that may explain the 410 error.

Instead of

http://168.63.129.16/machine/b056350c-03a1-4efc-9ef8-584053087d59/b6324691%2D6ee6%2D45f0%2Da84c%2D7002f3fc6459.%5FVmTestRun2004?comp=config&type=extensionsConfig&incarnation=3

you can try

http://168.63.129.16/machine/b056350c-03a1-4efc-9ef8-584053087d59/b6324691-6ee6-45f0-a84c-7002f3fc6459._VmTestRun2004?comp=config&type=extensionsConfig&incarnation=3

Just tried with the most recent error from the log.

sudo wget \
>     --header='x-ms-agent-name: WALinuxAgent' \
>     --header='x-ms-version: 2012-11-30' \
>     --header='Connection: close' \
>     --header='User-Agent: WALinuxAgent/2.2.46' \
>  'http://168.63.129.16/machine/0ec58157-97ef-4fa7-a6fb-f4202e2e5603/5e9212ff%2D09c2%2D4da4%2D8dbe%2Dcb54d99d2e00.%5FVmTestRun2004?comp=config&type=extensionsConfig&incarnation=1'
--2022-03-02 16:12:50--  http://168.63.129.16/machine/0ec58157-97ef-4fa7-a6fb-f4202e2e5603/5e9212ff%2D09c2%2D4da4%2D8dbe%2Dcb54d99d2e00.%5FVmTestRun2004?comp=config&type=extensionsConfig&incarnation=1
Connecting to 168.63.129.16:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6791 (6.6K) [text/xml]
Saving to: ‘5e9212ff-09c2-4da4-8dbe-cb54d99d2e00._VmTestRun2004?comp=config&type=extensionsConfig&incarnation=1’

5e9212ff-09c2-4da4-8dbe-cb54d99d2e00._VmTestRun2004?comp 100%[================================================================================================================================>]   6.63K  --.-KB/s    in 0s      

2022-03-02 16:12:50 (912 MB/s) - ‘5e9212ff-09c2-4da4-8dbe-cb54d99d2e00._VmTestRun2004?comp=config&type=extensionsConfig&incarnation=1’ saved [6791/6791]

The template is based on the latest image of Ubuntu 20.04 from the marketplace in this case. It's just the most basic standard VM. If I however don't execute runCommand in the same template the agent is installed as it should. Could it be that the runCommand interferes with the initial setup of the agent?

@Hardstl The response was saved to
Saving to: ‘5e9212ff-09c2-4da4-8dbe-cb54d99d2e00._VmTestRun2004?comp=config&type=extensionsConfig&incarnation=1’
What is value of RuntimeSettings (don't post any confidential info)

@narrieta here's the content.

<PluginSettings>
  <Plugin name="Microsoft.CPlat.Core.RunCommandHandlerLinux" version="1.3.1">
    <ExtensionRuntimeSettings seqNo="0" name="runCommand1" state="enabled">{
  "runtimeSettings": [
    {
      "handlerSettings": {
        "protectedSettingsCertThumbprint": "removed thumbprint",
        "protectedSettings": "removed protected settings",
        "publicSettings": {"source":{"script":"touch /home/hadmin/test.txt"},"outputBlobUri":"https://anatfwlogs.blob.core.windows.net/error/output.txt","errorBlobUri":"https://anatfwlogs.blob.core.windows.net/error/error.txt","timeoutInSeconds":120}
      }
    }
  ]
}</ExtensionRuntimeSettings>
  </Plugin>
</PluginSettings>

@Hardstl - that looks good to me. This will require more debugging. From code inspection it seems to me that this is the same issue we already fixed in the current agent, the entire parsing of that json is now done differently. I'll leave this issue open and add it to our bug backlog, but it may take some time before we can get to it.

Sounds good. Thanks for your help!

Hi @narrieta Any update on this bug?

@Hardstl @JvDrunen

We have identified the issue and are working on a mitigation and fix.

The issue is in the interaction of the Run Command extension (version 2) and older agents in the base VM images. To mitigate, you can remove Run Command from the VM.

For the time being, do not use Run Command from the CLI (which uses V2). Using it from the portal (which uses V1) is OK

I'm hitting same problem with CustomScript Extension (V2)

2022-10-27T05:53:48.263783Z WARNING Daemon Agent WALinuxAgent-2.2.46 launched with command 'python3 -u /usr/sbin/waagent -run-exthandlers' failed with return code: 1
2022-10-27T05:53:48.271214Z ERROR Daemon Event: name=WALinuxAgent, op=Enable, message=eJw1zLsOgzAMQNFf8cYUokLVgS07e2eXuE2kYCrHFvTveUgdr3R1wodY4RnGzLaFM1zXdu39AQWNp0QR1qwJpmWekSM035+mhXtwBt6q+PrK7FfEy3Fi7GjTdKyFpDbwxlz+hpCa8EFFGuC2AyUXKrQ=, duration=0
2022-10-27T05:53:48.280208Z WARNING Daemon Agent WALinuxAgent-2.2.46 launched with command 'python3 -u /usr/sbin/waagent -run-exthandlers' returned code: 1
2022-10-27T05:53:48.286191Z INFO Daemon Installed Agent WALinuxAgent-2.2.46 is the most current agent
2022-10-27T05:53:49.220645Z INFO ExtHandler Agent WALinuxAgent-2.2.46 is running as the goal state agent
2022-10-27T05:53:49.240505Z INFO ExtHandler Distro info: ubuntu 22.04, osutil class being used: UbuntuOSUtil, agent service name: walinuxagent
2022-10-27T05:53:49.271294Z INFO ExtHandler WireServer endpoint 168.63.129.16 read from file
2022-10-27T05:53:49.291756Z INFO ExtHandler Wire server endpoint:168.63.129.16
2022-10-27T05:53:49.331166Z WARNING ExtHandler Agent WALinuxAgent-2.2.46 failed with exception: str expected, not NoneType
2022-10-27T05:53:49.361480Z WARNING ExtHandler Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/azurelinuxagent/ga/update.py", line 268, in run
    monitor_thread.run()
  File "/usr/lib/python3/dist-packages/azurelinuxagent/ga/monitor.py", line 146, in run
    self.init_sysinfo()
  File "/usr/lib/python3/dist-packages/azurelinuxagent/ga/monitor.py", line 185, in init_sysinfo
    vminfo = self.protocol.get_vminfo()
  File "/usr/lib/python3/dist-packages/azurelinuxagent/common/protocol/wire.py", line 113, in get_vminfo
    goal_state = self.client.get_goal_state()
  File "/usr/lib/python3/dist-packages/azurelinuxagent/common/protocol/wire.py", line 856, in get_goal_state
    self.goal_state = GoalState(xml_text)
  File "/usr/lib/python3/dist-packages/azurelinuxagent/common/protocol/wire.py", line 1381, in __init__
    self._parse(xml_text)
  File "/usr/lib/python3/dist-packages/azurelinuxagent/common/protocol/wire.py", line 1398, in _parse
    os.environ[CONTAINER_ID_ENV_VARIABLE] = self.container_id
  File "/usr/lib/python3.10/os.py", line 684, in __setitem__
    value = self.encodevalue(value)
  File "/usr/lib/python3.10/os.py", line 756, in encode
    raise TypeError("str expected, not %s" % type(value).__name__)
TypeError: str expected, not NoneType

Update - I used ubuntu 18_04-lts-gen2 image and it worked fine. problem seems to be with ubuntu 20.04 and 22.04

@miqm The issue you see is different. Custom Script is not affected by the issue in Run Command. If you can give us some repro steps we can take a look. Please open a new issue for that.

Seeing this issue as well. Just a repeated mess of these:

2022/10/27 20:04:34.807587 ERROR Daemon Event: name=WALinuxAgent, op=Enable, message=eJw1zLsOgzAMQNFf8cYUokK7sGVn7+wSt4kUTOXYgv49D6njla5O+BArPMOY2bZwhuvarr0/oKDxlCjCmjXBtMwzcoTm+9O0cA/OwFsVX1+Z/Yp4OU6MHW2ajrWQ1AbemMvfEFITPqhIA9x2JLUqsw==, duration=0
2022/10/27 20:04:34.816182 WARNING Daemon Agent WALinuxAgent-2.2.45 launched with command 'python3 -u /usr/sbin/waagent -run-exthandlers' returned code: 1
2022/10/27 20:04:34.822386 INFO Daemon Installed Agent WALinuxAgent-2.2.45 is the most current agent

Ubuntu 18.04, gen 2 image.

I tried downgrading to the 2.2.32-0ubuntu1~18.04.2 version of the package via the serial console, but same issues.
Note: this renders the machine unusable since that service then fails to launch, the boot process stalls, ssh doesn't seem to load, and the serial console gets flooded with these messages.
Had to boot into single user mode via the serial console and try and execute commands without being able to see what I was typing :/

@narrieta - I tried to reproduce but it seems to be working now. If I experience this error again I'll submit a new ticket.

@bpkroth was this with RunCommand? this particular issue should not affect SSH. I can post a possible mitigation if you are using RunCommand, but it does require SSH access.

No, we didn't do anything so far - just tried to boot the VM, but somehow the waagent service is getting stuck and preventing other things from making progress.

@miqm The call stack of the initial error is interesting...

os.environ[CONTAINER_ID_ENV_VARIABLE] = self.container_id

if the error happens again could you capture the contents of GoalState.*.xml under /var/lib/waagent and check it is a valid XML file and has a valid ContainerID element? (it should be a GUID)

@bpkroth if you did not execute RunCommand, then it is a very different issue. The issue in this thread is very particular to executing RunCommand from a template or from the CLI. You should probably open a support case for your machine.

We may have during the VM setup process. I was taking this over from someone else. Is there a way to clear pending RunCommands?

you can check if you executed runcommand on this vm by fetching the instance view

az vm get-instance-view --name vmName --resource-group resourceGroupName

the mitigation is a little elaborate and requires SSH to the vm. i can summarize here once you confirm

@bpkroth the issue on your vm is unrelated... the error message is encoded in the log snippet you posted. here it is

$ echo 'eJw1zLsOgzAMQNFf8cYUokK7sGVn7+wSt4kUTOXYgv49D6njla5O+BArPMOY2bZwhuvarr0/oKDxlCjCmjXBtMwzcoTm+9O0cA/OwFsVX1+Z/Yp4OU6MHW2ajrWQ1AbemMvfEFITPqhIA9x2JLUqsw=='| base64 -d | pigz -zd
Agent WALinuxAgent-2.2.45 launched with command 'python3 -u /usr/sbin/waagent -run-exthandlers' failed with return code: 1

If you have SSH access you could try running the agent from the command line, which may give you more information (or /var/log/waagent.log may have more useful info). Command line

python3 -u /usr/sbin/waagent -run-exthandlers

you can check if you executed runcommand on this vm by fetching the instance view

az vm get-instance-view --name vmName --resource-group resourceGroupName

the mitigation is a little elaborate and requires SSH to the vm. i can summarize here once you confirm

What am I looking for in that az output?
I have ssh access atm, but I had to boot single user without wagent.service running and start ssh.service manually, so the az output isn't showing much for the agent atm.

@bpkroth you would look if RunCommand shows up in the instance view, but your issue is unrelated, see my previous post

@bpkroth the issue on your vm is unrelated... the error message is encoded in the log snippet you posted. here it is

$ echo 'eJw1zLsOgzAMQNFf8cYUokK7sGVn7+wSt4kUTOXYgv49D6njla5O+BArPMOY2bZwhuvarr0/oKDxlCjCmjXBtMwzcoTm+9O0cA/OwFsVX1+Z/Yp4OU6MHW2ajrWQ1AbemMvfEFITPqhIA9x2JLUqsw=='| base64 -d | pigz -zd
Agent WALinuxAgent-2.2.45 launched with command 'python3 -u /usr/sbin/waagent -run-exthandlers' failed with return code: 1

If you have SSH access you could try running the agent from the command line, which may give you more information (or /var/log/waagent.log may have more useful info). Command line

python3 -u /usr/sbin/waagent -run-exthandlers

Oct 27 20:58:24 vm-name python3[30009]: 2022/10/27 20:58:24.366435 ERROR Daemon Event: name=WALinuxAgent, op=Enable, message=eJw1zLsOgzAMQNFf8cYUokK7sGVn7+wSt4kUTOXYgv49D6njla5O+BArPMOY2bZwhuvarr0/oKDxlCjCmjXBtMwzcoTm+9O0cA/OwFsVX1+Z/Yp4OU6MHW2ajrWQ1AbemMvfEFITPqhIA9x2JLUqsw==, duration=0
Oct 27 20:58:24 vm-name python3[30009]: 2022/10/27 20:58:24.375272 WARNING Daemon Agent WALinuxAgent-2.2.45 launched with command 'python3 -u /usr/sbin/waagent -run-exthandlers' returned code: 1
Oct 27 20:58:24 vm-name python3[30009]: 2022/10/27 20:58:24.381327 INFO Daemon Installed Agent WALinuxAgent-2.2.45 is the most current agent
Oct 27 20:58:24 vm-name python3[30009]: /usr/sbin/waagent:27: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
Oct 27 20:58:24 vm-name python3[30009]:   import imp
Oct 27 20:58:24 vm-name python3[30009]: Traceback (most recent call last):
Oct 27 20:58:24 vm-name python3[30009]:   File "/usr/sbin/waagent", line 31, in <module>
Oct 27 20:58:24 vm-name python3[30009]:     import azurelinuxagent.agent as agent
Oct 27 20:58:24 vm-name python3[30009]: ModuleNotFoundError: No module named 'azurelinuxagent'

OK, seems like agent is not properly installed on this VM. The agent is installed by the publisher of the image, is yours Canonical's? You could try browsing thru the agent logs and spot when this started happening and then try to see if something was done to the agent installation on that VM, or maybe a new Python was installed or something like that. You can also try opening a support case.

It was a fresh VM setup this morning. I tried reinstalling via apt, but it didn't seem to help. Wonder if the package got corrupt somehow. Anyways, working on creating a new one.

@Hardstl @JvDrunen

We have identified the issue and are working on a mitigation and fix.

The issue is in the interaction of the Run Command extension (version 2) and older agents in the base VM images. To mitigate, you can remove Run Command from the VM.

For the time being, do not use Run Command from the CLI (which uses V2). Using it from the portal (which uses V1) is OK

@narrieta Is there any update on this? I'm currently deploying a template using the Azure CLI 2.54.0. The VM is the Canonical Ubuntu 22.04 image. Sometimes the run command works, sometimes it doesn't. Very inconsistent.

The agent in this build is 2.2.46. I've enabled auto updates, but that doesn't appear to have any impact updating the agent version.

Output from when I directly run python3 -u /usr/sbin/waagent -run-exthandlers

2023-11-17T19:35:11.878794Z INFO ExtHandler Agent WALinuxAgent-2.2.46 is running as the goal state agent
2023-11-17T19:35:11.879211Z INFO ExtHandler Distro info: ubuntu 22.04, osutil class being used: UbuntuOSUtil, agent service name: walinuxagent
2023-11-17T19:35:11.881505Z INFO ExtHandler Detect protocol endpoints
2023-11-17T19:35:11.882250Z INFO ExtHandler Clean protocol and wireserver endpoint
2023-11-17T19:35:11.882690Z INFO ExtHandler WireServer endpoint is not found. Rerun dhcp handler
2023-11-17T19:35:11.883072Z INFO ExtHandler Test for route to 168.63.129.16
2023-11-17T19:35:11.883569Z INFO ExtHandler Route to 168.63.129.16 exists
2023-11-17T19:35:11.883900Z INFO ExtHandler Wire server endpoint:168.63.129.16
2023-11-17T19:35:11.901620Z INFO ExtHandler Fabric preferred wire protocol version:2015-04-05
2023-11-17T19:35:11.901932Z INFO ExtHandler Wire protocol version:2012-11-30
2023-11-17T19:35:11.902277Z INFO ExtHandler Server preferred version:2015-04-05
2023-11-17T19:35:12.522123Z INFO ExtHandler Found private key matching thumbprint {thumbprint}
2023-11-17T19:35:12.581723Z INFO ExtHandler Found private key matching thumbprint {thumbprint}
2023-11-17T19:35:12.646169Z INFO ExtHandler Found private key matching thumbprint {thumbprint}
2023-11-17T19:35:12.654090Z INFO ExtHandler WireServer is not responding. Reset dhcp endpoint
2023-11-17T19:35:12.657069Z INFO ExtHandler Protocol endpoint not found: WireProtocol, [ProtocolError] Exceeded max retry updating goal state
2023-11-17T19:35:12.663766Z INFO ExtHandler Protocol endpoint not found: MetadataProtocol, [ProtocolError] 404 - GET: http://169.254.169.254/Microsoft.Compute/identity?api-version=2015-05-01-preview
2023-11-17T19:35:12.669740Z INFO ExtHandler Retry detect protocols: retry=0
2023-11-17T19:35:22.682220Z INFO ExtHandler WireServer endpoint is not found. Rerun dhcp handler
2023-11-17T19:35:22.686006Z INFO ExtHandler Test for route to 168.63.129.16
2023-11-17T19:35:22.688490Z INFO ExtHandler Route to 168.63.129.16 exists
2023-11-17T19:35:22.690886Z INFO ExtHandler Wire server endpoint:168.63.129.16
2023-11-17T19:35:22.696872Z INFO ExtHandler Fabric preferred wire protocol version:2015-04-05
2023-11-17T19:35:22.699991Z INFO ExtHandler Wire protocol version:2012-11-30
2023-11-17T19:35:22.702494Z INFO ExtHandler Server preferred version:2015-04-05
2023-11-17T19:35:22.866212Z INFO ExtHandler Found private key matching thumbprint {thumbprint}
2023-11-17T19:35:22.927016Z INFO ExtHandler Found private key matching thumbprint {thumbprint}
2023-11-17T19:35:27.006651Z INFO ExtHandler Found private key matching thumbprint {thumbprint}
2023-11-17T19:35:27.014972Z INFO ExtHandler WireServer is not responding. Reset dhcp endpoint
2023-11-17T19:35:27.016908Z INFO ExtHandler Protocol endpoint not found: WireProtocol, [ProtocolError] Exceeded max retry updating goal state
2023-11-17T19:35:27.024163Z INFO ExtHandler Protocol endpoint not found: MetadataProtocol, [ProtocolError] 404 - GET: http://169.254.169.254/Microsoft.Compute/identity?api-version=2015-05-01-preview
2023-11-17T19:35:27.024357Z INFO ExtHandler Retry detect protocols: retry=1

If run command v2 is in your vm model and whatever reason agent running below < 2.4.0.2, agent may crash while parsing the goal state.

The known scenarios and their mitigation