gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Avoid updating VM to set delete option for disks and nics when toBeDetached is already true

unmarshall opened this issue · comments

What would you like to be added:

It was highlighted in CanaryIssue #4377 that there are cases where the disk detachment can get stuck. In the older code we used to first detach the disk, delete the VM and then delete the disk. This causes issues when the disk detachment is stuck. Attach/Detach disk errors are common in Azure. As a consequence the corresponding machine object is stuck in Terminating state. As per the issue that we observed the machine was stuck in Terminating state for more than 5 days.

This issues does not attempt to resolve the stuck disk detachment which should ideally be investigated by MSMT as we do not have insights into how they are handing state transitions, retries etc on the server side.

YAML snippet of the storageProfile of the VM. As you can see for data disks toBeDetached is set to true.

"storageProfile": {
    "dataDisks": [
      {
        "caching": "None",
        "createOption": "Empty",
        "deleteOption": "Detach",
        "detachOption": null,
        "diskIopsReadWrite": null,
        "diskMBpsReadWrite": null,
        "diskSizeGb": 1,
        "image": null,
        "lun": 0,
        "managedDisk": {
          "diskEncryptionSet": null,
          "id": "/subscriptions/--REDACTED--/resourceGroups/--REDACTED--/providers/Microsoft.Compute/disks/--REDACTED--",
          "resourceGroup": "--REDACTED--",
          "securityProfile": null,
          "storageAccountType": "Premium_LRS"
        },
        "name": "--REDACTED--",
        "toBeDetached": true,
        "vhd": null,
        "writeAcceleratorEnabled": null
      }
    ],
    "diskControllerType": "SCSI",
    "imageReference": {
      "communityGalleryImageId": null,
      "exactVersion": "1062800125.0.0",
      "id": null,
      "offer": null,
      "publisher": null,
      "sharedGalleryImageId": "/SharedGalleries/--REDACTED--/Images/vSMP_MemoryONE/Versions/1062800125.0.0",
      "sku": null,
      "version": null
    },
    "osDisk": {
      "caching": "None",
      "createOption": "FromImage",
      "deleteOption": "Detach",
      "diffDiskSettings": null,
      "diskSizeGb": 150,
      "encryptionSettings": null,
      "image": null,
      "managedDisk": {
        "diskEncryptionSet": null,
        "id": "/subscriptions/--REDACTED--/resourceGroups/--REDACTED--/providers/Microsoft.Compute/disks/--REDACTED--",
        "resourceGroup": "--REDACTED--",
        "securityProfile": null,
        "storageAccountType": "Premium_LRS"
      },
      "name": "--REDACTED---os-disk",
      "osType": "Linux",
      "vhd": null,
      "writeAcceleratorEnabled": null
    }
  },

The error that one would get when an update of this VM to change the delete options for Disks is attempted is:

Failed to trigger update of VM [ResourceGroup: --REDACTED--, VMName: --REDACTED--] : Azure API Response-Headers: map[x-ms-correlation-request-id:--REDACTED-- x-ms-request-id:--REDACTED--] Err: PATCH https://management.azure.com/subscriptions/--REDACTED--/resourceGroups/--REDACTED--/providers/Microsoft.Compute/virtualMachines/--REDACTED--
--------------------------------------------------------------------------------
RESPONSE 409: 409 Conflict
ERROR CODE: AttachDiskWhileBeingDetached
--------------------------------------------------------------------------------
{
  "error": {
    "code": "AttachDiskWhileBeingDetached",
    "message": "Cannot attach data disk '--REDACTED---data-disk' to VM '--REDACTED--' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again. Instructions can be found at https://aka.ms/AzureDiskDetached",
    "target": "dataDisks"
  }
}
--------------------------------------------------------------------------------

We can check if toBeDetached is already set to true. If that is the case then we can skip the update of the VM since it will always fail.

Why is this needed:

It saves one unnecessary Azure provider API (update VM) call as that call will always fail as long as toBeDetached is set to true for any data disk. This further prevents hitting the API rate limit which was also observed if the number of such machines are >= 10.

@himanshu-kun and I also decided that we will enhance the ListMachine and also extract the VM names from disks as well. This will ensure that the DeleteMachine is also called if there is a left over disk (OS or Data disk).