machine-controller-manager cannot delete Machine when the resource group is already deleted

Question

machine-controller-manager cannot delete Machine when the resource group is already deleted

ialidzhikov opened this issue 4 years ago · comments

What happened:

$ k -n shoot--foo--bar get machines shoot--foo--bar-bar-z3-5bf676f44d-kt27p -o yaml

status:
  currentStatus:
    lastUpdateTime: "2020-11-20T07:52:09Z"
    phase: Terminating
  lastOperation:
    description: 'Cloud provider message - compute.VirtualMachinesClient#List: Failure
      responding to request: StatusCode=404 -- Original Error: autorest/azure: Service
      returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource
      group ''shoot--foo--bar'' could not be found."'
    lastUpdateTime: "2020-11-20T07:52:09Z"
    state: Failed
    type: Delete

What you expected to happen:
machine-controller-manager should be able to delete Machine when the resource group is already deleted. For example the azure cloud-controller-manager can handle this case and can delete Service of type LoadBalancer when the underlying resource-group is already deleted.

How to reproduce it (as minimally and precisely as possible):

Create an Azure Machine
Delete the underlying resource group
Ensure that machine-controller-manager cannot delete the Machine CR

Anything else we need to know:

Environment:

machine-controller-manager version: v0.34.3

Ismail Alidzhikov commented 3 years ago

/reopen

Ismail Alidzhikov · Answer 1 · Fri Nov 20 2020 16:11:29 GMT+0800 (China Standard Time)

/kind bug
/platform azure
/cc @dkistner

Ismail Alidzhikov · Answer 2 · Tue Apr 06 2021 16:54:41 GMT+0800 (China Standard Time)

There is one more case that needs to be resolved. It occurs on deletion of Machine that still does not have a providerID.

In pkg/controller/machine.go the machineDelete func has a special handling when the provideID does not exists - see https://github.com/gardener/machine-controller-manager/blob/42c0dc4dbe0077867b109334d428e2fe64082e4d/pkg/controller/machine.go#L746-L768. It tries to list all VMs and find a machine with a corresponding name. The driver.GetVMs("") call fails with ResourceGroupNotFound, hence the Machine cannot be deleted.

Potential steps to reproduce would be:

Trigger machine creation
Delete the underlying resource group while new machine is being created
Make sure that the Machine from step 1 cannot be created
Delete the Machine
Make sure that the Machine deletion fails as described above.

Ismail Alidzhikov · Answer 3 · Tue Apr 06 2021 19:20:27 GMT+0800 (China Standard Time)

To have a proper fix for the issue described above, I think we should introduce codes like InstanceNotFound as it is done in the upstream cloud-controller-manager. See for example https://github.com/kubernetes/kubernetes/blob/b0abe89ae259d5e891887414cb0e5f81c969c697/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L693-L707.

This will allow the caller of driver.GetVMs to properly handle not found errors. Currently the caller of driver.GetVMs does not know whether the returned error is a "not found" error returned by the cloud provider. I think that it should be up to the caller of driver.GetVMs how to handle "not found" errors.

So a band-aid fix such as

diff --git a/pkg/driver/driver_azure.go b/pkg/driver/driver_azure.go
index 503fb6ba..d5c11f62 100644
--- a/pkg/driver/driver_azure.go
+++ b/pkg/driver/driver_azure.go
@@ -363,6 +363,14 @@ func (d *AzureDriver) GetVMs(machineID string) (result VMs, err error) {
                tags              = d.AzureMachineClass.Spec.Tags
        )

+       if _, err := clients.group.Get(ctx, resourceGroupName); err != nil {
+               if notFound(err) {
+                       return nil
+               }
+               return err
+       }
+
        listOfVMs, err := clients.getRelevantVMs(ctx, machineID, resourceGroupName, location, tags)
        if err != nil {
                return

is not the most proper one, I believe.

On the other side GetVMs(machineID string) func itself seems to be not the right one as it returns single VM when a non-empty machineID is passed, otherwise it lists all VMs.

I am confused to be honest, so won't file any PRs about the issue. 😕

Ismail Alidzhikov · Answer 4 · Tue Apr 06 2021 19:20:34 GMT+0800 (China Standard Time)

/assign @prashanth26

Prashanth · Answer 5 · Tue Apr 06 2021 23:47:23 GMT+0800 (China Standard Time)

Hi @ialidzhikov ,

Thanks for reopening this issue and reopening a hidden bug. I guess if we are fixing this, it would make more sense to move this fix as a part of the OOT as Azure would also be moved OOT soon. Also, do you have an idea of how important this fix is and how much it impacts us? I ask this as there are several priority fixes pending on MCM/Autoscaler. So just wanted to understand the urgency to prioritize this.

cc : @AxiomSamarth

Ismail Alidzhikov · Answer 6 · Wed Apr 07 2021 03:22:49 GMT+0800 (China Standard Time)

It is rather a corner case but still needs a fix.

/priority 3

Prashanth · Answer 7 · Wed Apr 07 2021 11:02:12 GMT+0800 (China Standard Time)

Okay sure. Thanks, will keep it in mind.

Dominic Kistner · Answer 8 · Wed Apr 07 2021 15:45:37 GMT+0800 (China Standard Time)

Should we create an issue here https://github.com/gardener/machine-controller-manager-provider-azure/issues to track that?

Prashanth · Answer 9 · Wed Apr 07 2021 16:09:06 GMT+0800 (China Standard Time)

Should we create an issue here https://github.com/gardener/machine-controller-manager-provider-azure/issues to track that?

Good point. I have transferred it here.
Maybe the fix needs to be generic. However, keeping the issue here.

Dieter Guendisch · Answer 10 · Tue Nov 02 2021 20:15:41 GMT+0800 (China Standard Time)

Happened again last week.