gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

machine-controller-manager cannot delete Machine when the resource group is already deleted

ialidzhikov opened this issue · comments

What happened:

$ k -n shoot--foo--bar get machines shoot--foo--bar-bar-z3-5bf676f44d-kt27p -o yaml
status:
  currentStatus:
    lastUpdateTime: "2020-11-20T07:52:09Z"
    phase: Terminating
  lastOperation:
    description: 'Cloud provider message - compute.VirtualMachinesClient#List: Failure
      responding to request: StatusCode=404 -- Original Error: autorest/azure: Service
      returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource
      group ''shoot--foo--bar'' could not be found."'
    lastUpdateTime: "2020-11-20T07:52:09Z"
    state: Failed
    type: Delete

What you expected to happen:
machine-controller-manager should be able to delete Machine when the resource group is already deleted. For example the azure cloud-controller-manager can handle this case and can delete Service of type LoadBalancer when the underlying resource-group is already deleted.

How to reproduce it (as minimally and precisely as possible):

  1. Create an Azure Machine
  2. Delete the underlying resource group
  3. Ensure that machine-controller-manager cannot delete the Machine CR

Anything else we need to know:

Environment:

  • machine-controller-manager version: v0.34.3

/kind bug
/platform azure
/cc @dkistner

There is one more case that needs to be resolved. It occurs on deletion of Machine that still does not have a providerID.

In pkg/controller/machine.go the machineDelete func has a special handling when the provideID does not exists - see https://github.com/gardener/machine-controller-manager/blob/42c0dc4dbe0077867b109334d428e2fe64082e4d/pkg/controller/machine.go#L746-L768. It tries to list all VMs and find a machine with a corresponding name. The driver.GetVMs("") call fails with ResourceGroupNotFound, hence the Machine cannot be deleted.

Potential steps to reproduce would be:

  1. Trigger machine creation
  2. Delete the underlying resource group while new machine is being created
  3. Make sure that the Machine from step 1 cannot be created
  4. Delete the Machine
  5. Make sure that the Machine deletion fails as described above.

To have a proper fix for the issue described above, I think we should introduce codes like InstanceNotFound as it is done in the upstream cloud-controller-manager. See for example https://github.com/kubernetes/kubernetes/blob/b0abe89ae259d5e891887414cb0e5f81c969c697/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L693-L707.

This will allow the caller of driver.GetVMs to properly handle not found errors. Currently the caller of driver.GetVMs does not know whether the returned error is a "not found" error returned by the cloud provider. I think that it should be up to the caller of driver.GetVMs how to handle "not found" errors.

So a band-aid fix such as

diff --git a/pkg/driver/driver_azure.go b/pkg/driver/driver_azure.go
index 503fb6ba..d5c11f62 100644
--- a/pkg/driver/driver_azure.go
+++ b/pkg/driver/driver_azure.go
@@ -363,6 +363,14 @@ func (d *AzureDriver) GetVMs(machineID string) (result VMs, err error) {
                tags              = d.AzureMachineClass.Spec.Tags
        )

+       if _, err := clients.group.Get(ctx, resourceGroupName); err != nil {
+               if notFound(err) {
+                       return nil
+               }
+               return err
+       }
+
        listOfVMs, err := clients.getRelevantVMs(ctx, machineID, resourceGroupName, location, tags)
        if err != nil {
                return

is not the most proper one, I believe.

On the other side GetVMs(machineID string) func itself seems to be not the right one as it returns single VM when a non-empty machineID is passed, otherwise it lists all VMs.

I am confused to be honest, so won't file any PRs about the issue. 😕

Hi @ialidzhikov ,

Thanks for reopening this issue and reopening a hidden bug. I guess if we are fixing this, it would make more sense to move this fix as a part of the OOT as Azure would also be moved OOT soon. Also, do you have an idea of how important this fix is and how much it impacts us? I ask this as there are several priority fixes pending on MCM/Autoscaler. So just wanted to understand the urgency to prioritize this.

cc : @AxiomSamarth

It is rather a corner case but still needs a fix.

/priority 3

Okay sure. Thanks, will keep it in mind.

Should we create an issue here https://github.com/gardener/machine-controller-manager-provider-azure/issues to track that?

Good point. I have transferred it here.
Maybe the fix needs to be generic. However, keeping the issue here.

Happened again last week.