machine-controller-manager cannot delete Machine when the resource group is already deleted
ialidzhikov opened this issue · comments
What happened:
$ k -n shoot--foo--bar get machines shoot--foo--bar-bar-z3-5bf676f44d-kt27p -o yaml
status:
currentStatus:
lastUpdateTime: "2020-11-20T07:52:09Z"
phase: Terminating
lastOperation:
description: 'Cloud provider message - compute.VirtualMachinesClient#List: Failure
responding to request: StatusCode=404 -- Original Error: autorest/azure: Service
returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource
group ''shoot--foo--bar'' could not be found."'
lastUpdateTime: "2020-11-20T07:52:09Z"
state: Failed
type: Delete
What you expected to happen:
machine-controller-manager should be able to delete Machine when the resource group is already deleted. For example the azure cloud-controller-manager can handle this case and can delete Service of type LoadBalancer when the underlying resource-group is already deleted.
How to reproduce it (as minimally and precisely as possible):
- Create an Azure Machine
- Delete the underlying resource group
- Ensure that machine-controller-manager cannot delete the Machine CR
Anything else we need to know:
Environment:
- machine-controller-manager version: v0.34.3
/kind bug
/platform azure
/cc @dkistner
/reopen
There is one more case that needs to be resolved. It occurs on deletion of Machine that still does not have a providerID
.
In pkg/controller/machine.go
the machineDelete
func has a special handling when the provideID does not exists - see https://github.com/gardener/machine-controller-manager/blob/42c0dc4dbe0077867b109334d428e2fe64082e4d/pkg/controller/machine.go#L746-L768. It tries to list all VMs and find a machine with a corresponding name. The driver.GetVMs("")
call fails with ResourceGroupNotFound
, hence the Machine cannot be deleted.
Potential steps to reproduce would be:
- Trigger machine creation
- Delete the underlying resource group while new machine is being created
- Make sure that the Machine from step 1 cannot be created
- Delete the Machine
- Make sure that the Machine deletion fails as described above.
To have a proper fix for the issue described above, I think we should introduce codes like InstanceNotFound
as it is done in the upstream cloud-controller-manager. See for example https://github.com/kubernetes/kubernetes/blob/b0abe89ae259d5e891887414cb0e5f81c969c697/staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go#L693-L707.
This will allow the caller of driver.GetVMs
to properly handle not found errors. Currently the caller of driver.GetVMs
does not know whether the returned error is a "not found" error returned by the cloud provider. I think that it should be up to the caller of driver.GetVMs
how to handle "not found" errors.
So a band-aid fix such as
diff --git a/pkg/driver/driver_azure.go b/pkg/driver/driver_azure.go
index 503fb6ba..d5c11f62 100644
--- a/pkg/driver/driver_azure.go
+++ b/pkg/driver/driver_azure.go
@@ -363,6 +363,14 @@ func (d *AzureDriver) GetVMs(machineID string) (result VMs, err error) {
tags = d.AzureMachineClass.Spec.Tags
)
+ if _, err := clients.group.Get(ctx, resourceGroupName); err != nil {
+ if notFound(err) {
+ return nil
+ }
+ return err
+ }
+
listOfVMs, err := clients.getRelevantVMs(ctx, machineID, resourceGroupName, location, tags)
if err != nil {
return
is not the most proper one, I believe.
On the other side GetVMs(machineID string)
func itself seems to be not the right one as it returns single VM when a non-empty machineID is passed, otherwise it lists all VMs.
I am confused to be honest, so won't file any PRs about the issue. 😕
/assign @prashanth26
Hi @ialidzhikov ,
Thanks for reopening this issue and reopening a hidden bug. I guess if we are fixing this, it would make more sense to move this fix as a part of the OOT as Azure would also be moved OOT soon. Also, do you have an idea of how important this fix is and how much it impacts us? I ask this as there are several priority fixes pending on MCM/Autoscaler. So just wanted to understand the urgency to prioritize this.
cc : @AxiomSamarth
It is rather a corner case but still needs a fix.
/priority 3
Okay sure. Thanks, will keep it in mind.
Should we create an issue here https://github.com/gardener/machine-controller-manager-provider-azure/issues to track that?
Should we create an issue here https://github.com/gardener/machine-controller-manager-provider-azure/issues to track that?
Good point. I have transferred it here.
Maybe the fix needs to be generic. However, keeping the issue here.
Happened again last week.