Azure / WALinuxAgent

Microsoft Azure Linux Guest Agent

Home Page:http://azure.microsoft.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WALinuxAgent doesn't download all .crt and .prv files from KeyVault

CMalaquias17 opened this issue · comments

Not sure if this is a bug but I will try to explain as much as I can.

Environment: Virtual Machine Scale Set - Azure (West EU, West US, Japan, Asia,.. all regions impacted)

We are doing deployments of VMSS in azure using ARM template and ansible playbook with some configurations.
Before running the ansible, we are using the following to push certficates from KV:

TEMPLATE
{
"type": "Microsoft.Compute/virtualMachines",
"name": "Region1VM",
...
"properties": {
...
"osProfile": {
"computerName": "Region1VM",
...
"secrets": [
{
"sourceVault": {
"id": "[resourceId('Microsoft.KeyVault/vaults', Region1KeyVault)]"
},
"vaultCertificates": [
{
"certificateUrl": "[reference(resourceId('Microsoft.KeyVault/vaults/secrets', 'Region1KeyVault', 'SampleCertificateAsSecret')).secretUriWithVersion]",
"certificateStore": "My"
}
]
}
],
},
...
}
}

CODE
image

LINK
https://devblogs.microsoft.com/premier-developer/centralized-vm-certificate-deployment-across-multiple-regions-with-arm-templates/#part-2-push-certificate-from-the-regional-key-vault-to-the-virtual-machine

after running this part of the ARM template we do some certificate copies from var/lib/waagent to another location but it fails with the below error:

"could not find or access '/var/lib/waagent/nameexample.prv"

image

The problem is, the file that is missing should be downloaded during the push of the certificates from Keyvault but this is not happening and ansible playbook crashes.

If we restart the waagent service, the file "nameexample.prv" will be downloaded and ansible will not crash anymore.

The final lines of the ansible code, will remove this file again from the VM. The next deployment will crash again.

We have two workarounds here:
FIRST- if we restart the waagent after the crash everything will run as expected
SECOND - if we don't delete the file after the ansible

MAIN PROBLEM - this has been working like this for last 8 months, but now we are getting this errors.

we don't understand why, the agent doesn't push all the file in KV and we always need to restart the service to do a "complete download" let's say.

  • Distro and Version: RedHat|RHEL|7.3|
  • WALinuxAgent version: 2.9.0.4

Additional context
I can give you an agent log from 31st of January 4:30PM issue :

image

Log file attached

waagent.log

@CMalaquias17

For the example in the log you posted, at what time did you push the certificate update? The agent log does not show any operation updating certificates around 4:30.

Each operation that updates certificates produces a new sequential ID that we call "incarnation". The last certificate update closest to 4:30 was on incarnation 63, at 1:36:

2023-01-31T01:36:19.614894Z INFO ExtHandler Fetched a new incarnation for the WireServer goal state [incarnation 63]
2023-01-31T01:36:20.505239Z INFO ExtHandler Downloaded certificate {'hasPrivateKey': True, 'thumbprint': u'F6E6FAAC28A865022005BDAB233A1030F9C155CB'}

After that, there are no further operations updating certificates. The reason the certificate is downloaded after you restart the agent is because on service restart the agent repeats the last operation. In this case, it repeated incarnation 63 and downloaded the certificate again:

2023/01/31 04:53:28.065586 INFO Agent WALinuxAgent-2.2.14 forwarding signal 15 to WALinuxAgent-2.9.0.4
2023-01-31T04:53:28.474399Z INFO ExtHandler Fetched a new incarnation for the WireServer goal state [incarnation 63]
2023-01-31T04:53:29.398178Z INFO ExtHandler Downloaded certificate {'hasPrivateKey': True, 'thumbprint': u'F6E6FAAC28A865022005BDAB233A1030F9C155CB'}

Hello @narrieta , thanks for the explanation regarding incarnation, didn't know that. The time of pushing certificates is around 10 mins before the ansible task to copy the certificates.

As you can see, the incarnation 63 points to file F6E6FAACxxxxx and not for the one is missing F66D9xxxx. That means that file was not downloaded. We can't understand why during the push the secrets are not being downloaded but only after we restart the service. It seems that the PUSH is not forcing the download of the secrets or something like that.

Let me try to add more information here:

@CMalaquias17 I posted the wrong cert in my previous reply. F66D9 is also being downloaded as part of incarnation 63:

2023-01-31T01:36:19.635993Z INFO ExtHandler Fetching full goal state from the WireServer [incarnation 63]
2023-01-31T01:36:20.509214Z INFO ExtHandler Downloaded certificate {'hasPrivateKey': True, 'thumbprint': u'F66D94981674C46758C67D00238E84515113F843'}

2023/01/31 04:53:28.065586 INFO Agent WALinuxAgent-2.2.14 forwarding signal 15 to WALinuxAgent-2.9.0.4
2023-01-31T04:53:28.491823Z INFO ExtHandler Fetching full goal state from the WireServer [incarnation 63]
2023-01-31T04:53:29.402202Z INFO ExtHandler Downloaded certificate {'hasPrivateKey': True, 'thumbprint': u'F66D94981674C46758C67D00238E84515113F843'}

I see that Custom Script ran at this time

2023-01-31T04:25:28.118386Z INFO ExtHandler Fetched new vmSettings [HostGAPlugin correlation ID: b810f1b1-f05b-46c3-9afb-536ba70e9659 eTag: 15363476384684072639 source: FastTrack]

This operation is coming via "FastTrack", which is a recent optimization to make extensions execute faster. FastTrack operations won't download the keyvault certificates. As a workaround you can consider copying the certificates originally downloaded, instead of moving them to a different location.

@narrieta we are not moving the certificates, we are copying some of them and then delete only a few of them. Regarding this fast track thing, that I was not aware of it, when did this optimization started being utilized? I mean date and time. this could explain why the ARM template and script worked for very long time and started to fail a few months back. Is it possible to know when fast track started to being used?

Fast track is something that we should activate or it is activated in background without any actions needed?

Another thing, if I may ask. and sorry for very long questions.. when you say Fast Track operations"won'r download the keyvault certificates" it means that if we use that part in ARM template, it will not download the certificates that time?

thank you.

@CMalaquias17

Ok, then you may be deleting the cert that the custom script needs. The F6xxx cert was downloaded at 2023-01-31T01:36:19.635993Z.

Fast track was enabled over several months starting from late 2022. No action is needed from users.

The keyvault certificates will be downloaded, although not on every single operation (if the operation is using Fast Track then they won't be downloaded). In your case, the certificates were downloaded on incarnation 63, which is not using FastTrack

@narrieta I think you are right. The issue started in the final of November in only one region and since late December it spread for more the one region. This is something that we do for a very long time with same lines of code so, maybe fast track is the explanation for that.

is there a away to avoid fast track?

is there something we can do like, changing ARM templates or in the code to force the downloads?

Or,
the only way is to not delete the certificates from the VMs? let's say that, some of the certificates need to be rotated, at this point can we be sure that the new values will be download to the VM?

thank you.

@CMalaquias17

There are some operations that never use FastTrack and force a re-download of the certificates. Adding a tag ( az vm update --set tags.Tag1=Value1), re-apply (az vm reapply), etc.

If not deleting the certificates is an option, I would recommend that. We can change the agent to download certificates when using FastTrack too, but we don't have another release coming in the next few months.

As far as rotating the certificates, what do you currently do? Any changes in the ARM template involving certificates won't use Fast Track. In you case, running Custom Script used Fast Track because the operation is not related to certificates.

@narrieta by rotating I mean, Changing the key of one certificate or even add a new one to the KV, how the agent knows that it has more certificates to download?

So, it means that here (image below), we are not downloading anything?
image

@CMalaquias17 Yes, those are being downloaded. You can check incarnation 63

so sometimes when running ARM template it will download and sometimes it will not download?