gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Orphaned Azure NICs block subnet deletion afterwards

ialidzhikov opened this issue · comments

What happened:
I see quite often failed Azure Shoot deletions with error similar to:

-> Pod 'bar.infra.tf-destroy-jv2w9' reported:
* Error deleting Subnet "shoot--foo--bar-nodes" (Virtual Network "shoot--foo--bar" / Resource Group "shoot--foo--bar"): network.SubnetsClient#Delete: Failure sending request: StatusCode=400 -- Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet shoot--foo--bar-nodes is in use by /subscriptions/<omitted>/resourceGroups/shoot--foo--bar/providers/Microsoft.Network/networkInterfaces/shoot--foo--bar-cpu-worker-0-z1-7d4cd95c59-p5qgg-nic/ipConfigurations/shoot--foo--bar-cpu-worker-0-z1-7d4cd95c59-p5qgg-nic and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]

Releasing state lock. This may take a few moments...

Basically the Worker (and machine-controller-manager) resources are deleted but for some reason the NICs are not deleted and block the subnet deletion afterwards (during the Infrastructure deletion).

I can see that the NIC still exists:

$ az network nic show -g shoot--foo--bar -n shoot--foo--bar-cpu-worker-0-z1-7d4cd95c59-p5qgg-nic

What you expected to happen:
No NICs to be orphaned.

How to reproduce it (as minimally and precisely as possible):
Not clear for now.

Anything else we need to know
canary # 3637
live # 730
live # 2263
live # 2273

Environment:

  • machine-controller-manager version: v0.34.3 for buggy Delete() API call
  • machine-controller-manager version: v0.48.0 for buggy Get() API call

HI @ialidzhikov ,

How often have you observed this? And do you remember when the first occurrence of something like this was? Because when I look at the history changes on the Azure driver, I don't see any significant change in the last few months (atleast for v0.34.3, as there is a minor change in the master) - https://github.com/gardener/machine-controller-manager/commits/rel-v0.34.0/pkg/driver/driver_azure.go.

Also just checked the VM deletion logic looks like the machine deletion occurs along with Machine, Disk and NIC deletion. Also orphan NIC and disks are checked even after deletion here - https://github.com/gardener/machine-controller-manager/blob/master/pkg/driver/driver_azure.go#L308-L323. I am not sure how the situation occurs :o

How often have you observed this? And do you remember when the first occurrence of something like this was?

Well, I think it happens relatively rare for now. I cannot be sure when was the first occurrence. @dkistner , I expect you also observed this issue, right?
I now forwarded you 4 occurrences of this issue over Slack (I also resolved 1 occurrence last night).

/platform azure

Okay, thanks for the hints and info. Will try to look into it and see if I can find something.

Seen it here and there as well. @prashanth26 are you interested in particular instances? If yes I'll ping you the next time I see such thing.

Okay yes, @ialidzhikov has shared with me an example of the same. However, couldn't conclude anything useful from it until now.

/priority critical
/assign @dkistner @kon-angelo @prashanth26

We had seen occurances of NICs being present even with successful NIC deletion calls response from Azure APIs.

After discussion with @MSSedusch, it was decided that the possibility of having parallel create/delete calls due to a bug on the orphan VM handling logic might be increasing the occurrence of this issue. The same has been fixed by gardener/machine-controller-manager#589.

However, we suspect that this probably is only a side-effect, and something more at the azure infra is affecting this. Post this we keep a watch on this issue and see if it reoccurs.

This PR should fix this issue. Please reopen if seen again, will have to be investigated from Azure side in that case.

/close

We see this issue again, hence reopening the issue.

These are the latest logs

{"log":"Controller test-md created machine test-md-8rwbt","pid":"1","severity":"INFO","source":"controller_utils.go:599"}
2021-02-11 23:46:21	
{"log":"Event(v1.ObjectReference{Kind:\"MachineSet\", Namespace:\"test-ns\", Name:\"test-md\", UID:\"36278dd2-86aa-4215-8d87-ac3273e8cc5f\", APIVersion:\"machine.sapcloud.io/v1alpha1\", ResourceVersion:\"2828794802\", FieldPath:\"\"}): type: 'Normal' reason: 'SuccessfulCreate' Created Machine: test-md-8rwbt","pid":"1","severity":"INFO","source":"event.go:255"}
2021-02-11 23:46:21	
{"log":"Creating machine \"test-md-8rwbt\", please wait!","pid":"1","severity":"INFO","source":"machine.go:421"}
2021-02-11 23:46:27	
{"log":"NIC delete started for \"test-md-8rwbt-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1038"}
2021-02-11 23:46:27	
{"log":"NIC deleted for \"test-md-8rwbt-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1049"}
2021-02-11 23:46:27	
{"log":"VM.CreateOrUpdate failed for test-md-8rwbt: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 700, Current Usage: 656, Additional Required: 64, (Minimum) New Limit Required: 720. Submit a request for Quota increase at https://url by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests.\"","pid":"1","severity":"ERR","source":"driver_azure.go:1102"}
2021-02-11 23:46:27	
{"log":"Error while creating machine test-md-8rwbt: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 700, Current Usage: 656, Additional Required: 64, (Minimum) New Limit Required: 720. Submit a request for Quota increase at https://url by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests.\"","pid":"1","severity":"ERR","source":"machine.go:478"}
2021-02-11 23:46:27	
{"log":"Creating machine \"test-md-8rwbt\", please wait!","pid":"1","severity":"INFO","source":"machine.go:421"}
2021-02-11 23:46:30	
{"log":"NIC delete started for \"test-md-8rwbt-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1038"}
2021-02-11 23:46:30	
{"log":"NIC deleted for \"test-md-8rwbt-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1049"}
2021-02-11 23:46:30	
{"log":"VM.CreateOrUpdate failed for test-md-8rwbt: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 700, Current Usage: 656, Additional Required: 64, (Minimum) New Limit Required: 720. Submit a request for Quota increase at https://url by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests.\"","pid":"1","severity":"ERR","source":"driver_azure.go:1102"}
2021-02-11 23:46:30	
{"log":"Error while creating machine test-md-8rwbt: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: westeurope, Current Limit: 700, Current Usage: 656, Additional Required: 64, (Minimum) New Limit Required: 720. Submit a request for Quota increase at https://url by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests.\"","pid":"1","severity":"ERR","source":"machine.go:478"}
2021-02-11 23:46:30	
{"log":"Not updating the status of the machine object \"test-md-8rwbt\" , as it is already same","pid":"1","severity":"INFO","source":"machine.go:843"}
2021-02-11 23:48:27	
{"log":"Creating machine \"test-md-8rwbt\", please wait!","pid":"1","severity":"INFO","source":"machine.go:421"}
2021-02-11 23:48:30
..
..
..
{"log":"Found nic with name \"test-md-nic\", hence appending machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:812"}
2021-02-12 03:32:59	
{"log":"Check for disk leftovers belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1020"}
2021-02-12 03:32:59	
{"log":"Check for NIC leftovers belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1002"}
2021-02-12 03:32:59	
{"log":"Found orphan NIC \"test-md-nic\" belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1011"}
2021-02-12 03:32:59	
{"log":"NIC delete started for \"test-md-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1038"}
2021-02-12 03:32:59	
{"log":"NIC deleted for \"test-md-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1049"}
2021-02-12 03:32:59	
{"log":"SafetyController: Orphan VM found and terminated VM: test-md, azure:///westeurope/test-md","pid":"1","severity":"INFO","source":"machine_safety.go:645"}
2021-02-12 03:33:00	
{"log":"Found nic with name \"test-md-nic\", hence appending machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:812"}
2021-02-12 03:33:01	
{"log":"Check for disk leftovers belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1020"}
2021-02-12 03:33:01	
{"log":"Check for NIC leftovers belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1002"}
2021-02-12 03:33:01	
{"log":"Found orphan NIC \"test-md-nic\" belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1011"}
2021-02-12 03:33:01	
{"log":"NIC delete started for \"test-md-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1038"}
2021-02-12 03:33:01	
{"log":"NIC deleted for \"test-md-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1049"}
2021-02-12 03:33:01	
{"log":"SafetyController: Orphan VM found and terminated VM: test-md, azure:///westeurope/test-md","pid":"1","severity":"INFO","source":"machine_safety.go:645"}
2021-02-12 03:33:02	
{"log":"Found nic with name \"test-md-nic\", hence appending machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:812"}
2021-02-12 03:33:02	
{"log":"Check for disk leftovers belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1020"}
2021-02-12 03:33:02	
{"log":"Check for NIC leftovers belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1002"}
2021-02-12 03:33:02	
{"log":"Found orphan NIC \"test-md-nic\" belonging to deleted machine \"test-md\"","pid":"1","severity":"INFO","source":"driver_azure.go:1011"}
2021-02-12 03:33:02	
{"log":"NIC delete started for \"test-md-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1038"}
2021-02-12 03:33:02	
{"log":"NIC deleted for \"test-md-nic\"","pid":"1","severity":"INFO","source":"driver_azure.go:1049"}
2021-02-12 03:33:02	
{"log":"SafetyController: Orphan VM found and terminated VM: test-md, azure:///westeurope/test-md","pid":"1","severity":"INFO","source":"machine_safety.go:645"}

From the look of the logs, seems like the NICFuture.WaitForCompletionRef() doesn't seem to be respected, this can be confirmed with the timestamps as it took less than 6 seconds to create and delete the NIC in the code flow (during VM creation failures). This is used during creation and deletion. I guess when creation succeeds it doesn't cause issues.

The parallel creates/delete occurs when NIC creation succeeds but VM creation fails due to any cloud provider errors like - Quota issues etc. as NICFuture.WaitForCompletionRef() doesn't seem to be respected. I shall try to reproduce this locally to check

However, as you see at the end there are orphan NIC deletions sent which aren't handled properly by azure anyways as before.

cc: @MSSedusch @dkistner @AxiomSamarth

We have confirmed that the NICFuture.WaitForCompletionRef() doesn't seem to be respected in deletion for sure. As we can see this call returns within 10s of creation (with successful HTTP response), however azure activity monitor logs show success only after 2mins.

And in the case of creation, this operation doesn't make sense as the azure SDK doesn't provide this async handle to check status. So we have reached out to Azure with a ticket to get some insights on how this can be improved both for creation and deletion, and why these NICs are in this ghost/orphan state.

However, in the meanwhile. We will try to bandage the issue by issuing GET() prior to NIC creation, and confirm NIC deletion with GET() until no NIC is found for a given machine object.

One of the suggested improvements for the fix was to get the GETNICs() API call to confirm NIC deletions. We have implemented the same here - gardener/machine-controller-manager#594. We hope that this fix improves these errors, however, the underlying issue still occurs at Azure. We have raised a ticket on azure to know the reason for the same.

This prominently is reproducible on hitting VM quotas.

/reopen
As we see this issue again. We are trying to gain some clarity from Azure support colleagues as the WaitForCompletionRef and GetNICs don't seem to be returning the real work state.

We have done the best from our code by adding double checks. However, the Azure API doesn't seem to adhere to expected behaviour which has been confirmed by the MS Azure team.

This is now taken care by the Azure Team. Orphan NICs are lately not observed after the changes from Azure on their end. I shall close this issue. The same can be reopened if there are such issues in future again.

/reopen
as we saw ~5 new occurrences of this issue

Post Grooming discussion + Updates

We reached to Azure and explained that they had a problem with Get NIC API call. They responded back with the following

We have completed the analysis of your issue with an incorrect Get response being returned for four hours after a resource deletion and recreation.

By default when a resource is deleted, Azure Resource Manager creates a resource consistency job with a delayed start time and a resource long operation job to ensure that the resource is deprovisioned correctly in the control plane caches. For the Network resource provider, they also send a notification to Azure Resource Manager to immediately execute the first delayed resource consistency job when the Network resource provider finishes their asynchronous operation of deprovisioning a deleted resource.
In this case, we believe that a race condition occurred between the background job (resource consistency job) and the Put request which caused the resource to be incorrectly deleted in the Azure Resource Manager cache but successfully provisioned in the Network resource provider.
In most circumstances, the resource would not have shown up again after some time. However, in these circumstances, the resource long operation job from the initial Delete request created a resource consistency job when it completed to make sure the resource was deprovisioned correctly in Azure Resource Manager.
When this resource consistency job executed, it never got a 404 from the Network RP since the resource had been re-provisioned by the Put resource request. There is a 4-hour timeout on these jobs, and when the timeout occurred, the job "re-provisioned" the resource in the ARM cache which caused it to finally appear for the user.
As the problem was caused by a race condition, it is expected to occur infrequently and sporadically. Even though the system will ultimately ensure that the resource state is consistent, it may take up to four hours, as observed in this case. To address this issue, the Azure Resource Manager team will reduce the timeout, which should shorten the time it takes for the resource to return to its expected state. A more permanent solution is currently under consideration.

Since we don't have a permanent solution , we have decided NOT to depend on the Get() call before issuing a Delete() call , but issue Delete() directly.
We will still keep Get() after the Delete() to ensure NIC is gone.

In the new Azure SDK

In case the GetVM calls returns a 404 (not found) but there exists a NIC that as VirtualMachine SubResource still set (non-empty) and if we try and issue a DELETE for the NIC then it will return the following error:

RESPONSE 400: 400 Bad Request
ERROR CODE: NicInUse
--------------------------------------------------------------------------------
{
  "error": {
    "code": "NicInUse",
    "message": "Network Interface /subscriptions/82b44c79-a5d4-4d74-8ff8-8639e79c1c39/resourceGroups/shoot--mb-garden--sdktest/providers/Microsoft.Network/networkInterfaces/shoot--mb-garden--sdktest-worker-bingo-nic-alpha is used by existing resource /subscriptions/82b44c79-a5d4-4d74-8ff8-8639e79c1c39/resourceGroups/shoot--mb-garden--sdktest/providers/Microsoft.Compute/virtualMachines/shoot--mb-garden--sdktest-worker-bingo. In order to delete the network interface, it must be dissociated from the resource. To learn more, see aka.ms/deletenic.",
    "details": []
  }
}

This will be returned to MCM and a retry will happen. So there is no real need to make a GET call.

/close

fixed by #105