gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for force deleting VM when its provisioning state is FAILED

unmarshall opened this issue · comments

What would you like to be added:

In Azure if a Virtual Machine has ProvisionState set to Failed then it neither be updated or deleted. In this case the VM is stuck in this state. If the associated resources (NIC, OSDisk and DataDisk) have to be updated to set cascade delete then that will fail as in this state the VM updates are not allowed. Azure will return the following:

E1121 11:07:51.116477   26301 machine_util.go:1242] Error while deleting machine --REDACTED--: machine codes error: code = [Internal] message = [Failed to update cascade delete of associated resources for VM: [ResourceGroup: --REDACTED--, Name: --REDACTED--], Err: PATCH https://management.azure.com/subscriptions/--REDACTED--/resourceGroups/--REDACTED--/providers/Microsoft.Compute/virtualMachines/--REDACTED--
--------------------------------------------------------------------------------
RESPONSE 409: 409 Conflict
ERROR CODE: OperationNotAllowed
--------------------------------------------------------------------------------
{
  "error": {
    "code": "OperationNotAllowed",
    "message": "Operation 'Update VM' is not allowed on VM '--REDACTED--' since the VM is marked for deletion. You can only retry the Delete operation (or wait for an ongoing one to complete)."
  }
}
--------------------------------------------------------------------------------
]

In these situations, the VM should be deleted, followed by explicit deletion of all associated resources (NIC, OSDisk and DataDisk(s)).

Why is this needed:
This ensures that VM and its associated resources are cleaned up properly.
We have seen multiple issues in Canary [Issue #4358, #4389, #4390, #4377] where VM's were stuck with ProvisioningState = Failed for days and nothing could be done to clean them up. Operators would have to manually go and issue delete for the VMs. With this issue we attempt to clean up all resources automatically.

/close as fixed
Patch PR is raised as well #120