gardener / machine-controller-manager-provider-azure

This repository is the out of tree implementation of the machine driver for Azure cloud provider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Azure: Improve timeout scenario during machine creation

hardikdr opened this issue · comments

What would you like to be added: We recently faced a situation while machine-creation, where synchronous Create() call is waiting for a response in Azure, but the actual VM joins the cluster. The Create() call later times out with the following message:

E0806 07:21:56.635403       1 driver_azure.go:1050] Azure ARM API call with x-ms-request-id=e215fbcb-edb0-43a4-9dc6-4d021e25f84f failed. VM.WaitForCompletionRef failed for shoot-garden-az-eu1-cpu-worker-57cf885b68-zdxsv: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
E0806 07:21:56.635492       1 machine.go:428] Error while creating machine shoot-garden-az-eu1-cpu-worker-57cf885b68-zdxsv: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded

As the Create() call to the Azure failed, MCM decides to retry the creation by deleting the running VM.

  • Another side-effect being, MCM doesn't update the machine-status until the call is returned, which could mislead the user.

Possible solution: We should improve MCM to better handle the response from the WaitForCompletionRef call, possibly understand the error, and consider the operation successful if status-code 200 is seen.

@MSSedusch We are seeing the WaitForCompletionRef call being hung quite often since recently. Sometimes even before this call returns the actual VM is created and it joins the cluster. Do you see a need for this call, or could it be simply removed?

/ping @MSSedusch . We see this happening quite often in our landscapes now.

@MSSedusch

Message

/ping @MSSedusch . We see this happening quite often in our landscapes now.

Could this be related to the issue that @dguendisch reported to Microsoft a few weeks ago where the VM provisioning is not successful from Azure side but the OS is able to join the K8s cluster?

@MSSedusch - Yes, even I checked the Azure console to find similar VMs whose provisioning was in a failed state. And it was often observed while trying to spin up Standard_D8s_v3 and Standard_D32s_v3 machines on Azure. But with these large machines, I don't think they were joining also.

So I guess it can happen that the machine joins the the cluster (means the userdata is executed and kubelet has registered at the apiserver) but the machine provisioning on Azure is not completed. Then the context passed to WaitForCompletionRef could get canceled due to timeout and in consequence mcm will delete the machine once more, even if the machine is ready in the cluster.

So can we add a taint to the node to not schedule any pods and remove this taint again when the machine is also on the infrastructure successfully provisioned?

What you described above is correct. Though, adding short-lived taints can bring more complexity, would be more of a band-aid than solution.
Btw I am curious, what's the significance of FutureWaits here ? What happens if we simply remove those FutureWaits, MCM will anyways trigger the deletion of the machine if node doesn't join.

I dont think this would help. If the VM is able to join the cluster as indicated by Dominic but the VM is not provisioned successfully from an Azure point of view, K8s will start scheduling work in the node (potentially with PVCs) but the VM is unusable from Azure PoV and e.g. attach operations will fail.

I think making sure the node will not get any K8s workload until it is successfully provisioned in Azure is the right way and not a workaround/baind-aid.

@MSSedusch Hi Sebastian, I should have pitched in a bit earlier here. I am sorry, I missed. Having looked into this issue personally, I want to share some of my observations.

  • The VMs on Azure were not provisioned despite extended timeout on Machine Controller Manager.
  • Other tweak that we did to handle this was to increase the context deadline, client polling duration and even retry attempts in the WaitForCompleteRef function. However, even after more than an hour wait time, the machine (if I remember correctly then, it was a 64cpu spec) neither got provisioned nor had a scope to join the cluster as a node.
  • Another VM with lower spec used to get provisioned a few times. However, the time span (range) was inconsistent. That is, the VM which sometimes got provisioned and as well joined the node well within 30 minutes did not do the same many times even after an hour span.

The above were my observations and steps. Not sure, if what I did is the right way. But yeah, I am keen on understanding how to handle this issue elegantly in MCM. :)

CC: @hardikdr @prashanth26 @dkistner

I think we should distinguish between the current issue of VMs with Garden Linux are not able to be provisioned and are more generic issue of VMs that are not been completely provisioned yet join the cluster and potentially getting workload.

For the 1st one, I am currently trying to repro it.

For the 2nd one, I really think nodes should not get workload until MCM is done provisioning the VM. Otherwise, K8s might already try to attach disks to the VM and start operation on a VM that is still being updated.

@MSSedusch Thanks! I am eager to to know the results of your point 1

I will discuss with Prashanth & Hardik in regard to your point 2

For the 2nd one, I really think nodes should not get workload until MCM is done provisioning the VM. Otherwise, K8s might already try to attach disks to the VM and start operation on a VM that is still being updated.

@MSSedusch - I don't think any workload/disk can attach to the VM until the VM has been provisioned and the kubelet on the VM registers itself as the node on Kubernetes. We anyways wait for the node to join via Kubernetes to mark the VM as schedulable and healthy via MCM.

So keeping this in mind, we can probably get rid of this logic right? Bcoz currently while trying to provision large machines this call is stuck forever with/without proper timeouts and leading to context canceled error messages on machine creation failure. And i have been this call stuck for long periods (over 60mins as well, in worse cases). I would rather want to wait for machines joining as that is an async call and this is sync (blocking) one.

When do you mark the node as schedulable? When the node joins the K8s cluster or when the VM is provisioned successfully on Azure and joined the K8s cluster?

So right now, on other providers (AWS, Openstack, etc.), we issue a create VM call (non-blocking), and then when the VM joins as a K8s node (as we configure the kubelet via userData/cloudconfig) we mark them as a healthy & running machine form MCM perspective. So yes, we only wait till the node joins (with the assumption that provisioning is done as the Kubelet has started successfully).

And also to mention, even with today's code all though VM provisioning is a blocking call on Azure driver. The pods/PV scheduling is done by the KCM/CCM/Kube-scheduler based on when the node object joins the cluster independent of if the call is finished on MCM.

ok, this behavior can be problematic IMHO. Workload should only be scheduled if the VM is also ready from the cloud provider (e.g. Azure) perspective. Joining the cluster might be earlier than the VM being ready.

But again, this has nothing to do with the issue we currently have with Garden Linux. That is probably a different issue.

From our yesterday's discussions, one way to handle this issue is to take action based upon the error returned by the WaitForCompleteRef method. That is, if the method has returned 200 OK, then do not trigger VM deletion.

cc: @hardikdr @prashanth26

To mention, with gardener/machine-controller-manager#525 , we introduced a new phase: CrashLoopBackOff. MachineSet now doesn't delete the machine-object immediately on creation-call failure. It rather re-tries the same operation on the same object after a little delay. I believe this should complement the solution for the issue we are seeing here.

Essentially, I'd expect the following:

  • Failure of creation doesn't delete the machine immediately. As the create call on the same machine is idempotent, we wait again for the same machine to come back.
  • The creationTimeout is ineffective with CrashLoopBackOff phase, it would mean, even after the 20mins of default creation-timeout MCM won't delete the machine. @prashanth26 can you please confirm this, though I checked the code once already.

On top of the above, I'd suggest we look deeper into the WaitForCompletionRef call, and identify what all error codes does it return.

  • We then handle the err more gracefully, taking the right decision. cc @AxiomSamarth

@hardikdr I am not sure I fully understand what you are suggesting.

VMs that error out during creation must be deleted. Just retrying the create with the VM still existing wont work IMHO. The API should return an immediate error.

The original issue of this Github Issue is:
"We recently faced a situation while machine-creation, where synchronous Create() call is waiting for a response in Azure, but the actual VM joins the cluster. The Create() call later times out with the following message:"

IMHO you cannot prevent a VM from joining the cluster if the create of the VM will fail. You can only prevent that workload is scheduled on that node so when the Create call fails, you can delete the VM without harming the cluster.

Yes, that's correct. I missed that part, that we need to block the workload on the joining nodes.
With a bit more thought, I also feel the creationTimeout should be effective even for the new phase, machine-object should be replaced by creationTimeout, but that's a separate discussion.

Essentially I see, the issue is 2 fold.

  1. MCM needs to handle the WaitForCompletionRef call better. MCM should not simply replace the machine if the call returns an error, there could be multiple reasons why it returned.

    • From the documentation here, I see we are likely hitting the default context deadline of 15 mins, borrowed from the DefaultPollDuration, see doku.
    • We should probably have a larger context deadline or re-try specifically on canceled context.
  2. We need to block the workload from getting scheduled on the node even though the kubelet reports Ready.

    • We can probably bootstrap the kubelet with certain taints using flag --register-schedulable /--register-with-taints []api.Taint, and be MCM responsible to remove these taints, once the machine is successfully created, or creation is confirmed by Azure.
    • This is better done with the OOT implementation of Azure, I wouldn't suggest modifying the current in-tree version, as it will likely impact common machine-controller code.

Does this make sense @MSSedusch?

re 1. I would prefer increasing the deadline timeout over a retry. But I also dont see a huge problem with giving up after 15 minutes, delete the VM and try again.

re 2. Yes that make sense.

re 1. I would prefer increasing the deadline timeout over a retry. But I also dont see a huge problem with giving up after 15 minutes, delete the VM and try again.

We allow users to explicitly set the creationTimeout - time-period after which machine-creation is aborted and retried with fresh machine-object, and that could be more than 15mins, specifically for large machines. We'd have to set higher context-deadline then, and better error handling there, to respect that feature.

I also have a couple of general questions:

  • Recently we have seen multiple occurrences of VM creation failing/timing-out after 15 mins. This sounds like a long time for provisioning a VM, is this a known issue at Azure? Is there any way we could track the progress there?
  • Once the kubelet reports NodeReady status, it should mean at least OS has been loaded, root-disk must be configured/mounted, along with networking/nic. Out of curiosity, do you know what could go wrong, causing vm-provisioning to fail then while kubelet is reporting `NodeReady, or is it something specific to azure's provisioning mechanism?

We are currently investigating two issues. The first one is with the SUSE CHOST 15 SP2 image, the other one with Garden Linux. Both of them lead to long deployment times or deployments that never finish. Under normal circumstances, VMs should be deployed in a few minutes.

In the case of the gardener linux image, it seems like the OS is not sending a heart beat to Azure although it is fully booted. So that could explain why the VM is functional and can join the cluster but never finishes provisioning from an Azure point of view.

In the case of the gardener linux image, it seems like the OS is not sending a heart beat to Azure although it is fully booted. So that could explain why the VM is functional and can join the cluster but never finishes provisioning from an Azure point of view.

/cc @vpnachev

it seems like the OS is not sending a heartbeat to Azure although it is fully booted

What sort of heartbeat does the Azure expect from the OS on bootup? Is there any documentation we can refer to on this?

@jolusch You have mentioned internal references in the public. Please check.

Any updates on this topic @MSSedusch @hardikdr ? We run again into this issue while deleting shoots (about 60 of our pipeline shoots currently affected)

Hi @jolusch ,

Any updates on this topic @MSSedusch @hardikdr ? We run again into this issue while deleting shoots (about 60 of our pipeline shoots currently affected)

How is this affecting your clusters on shoot deletions :o Do you mean creations? If yes, then below is the answer.

In past, we have usually observed that the default creation time for machines of about 20mins is enough in most cases, however, yes in the case of very large machines we have seen it to cause issues. You could try to increase the creation Timeouts for your workers by setting this creation Timeout flag on your shoot and increasing it to 30mins maybe? However, I suspect that this may not completely eliminate your problem. We will try to dig deeper to find a better mechanism to interact with Azure SDKs to increase the timeouts that cause these cases.

Also, just my gut hint says that this is not an issue from our end, as currently we don't set any timeout (we plan to change this) while creating machines on azure, we use the timeout only after the creation of VM while waiting for machines to join. I suspect this issue occurs due to slower machine provisioning at Azure. However, yes we need to relook at this issue.

/priority 2
cc: @AxiomSamarth

I don't think this is an issue at MCM. On Azure we do not set any timeout VM creations, however, we do have a timeout on the maximum time given for the VM to register itself as a node after VM creation. But looking at the code I suspect there is nothing much we can do from MCM here as you see that the code merely calls azure SDKs and currently the context values have no timeouts set. So whatever timeout in which azure fails to provision seems to look like an infra provisioning delay.

MCM Provider Azure code now leverages the new Azure SDK. Therefore it needs to be observed if we see this issue again. Secondly gardener/machine-controller-manager#868 will introduce a new taint node.machine.sapcloud.io/instance-not-initialized that will be placed on the node via kubelet's --register-with-taints that will prevent any workload to get scheduled onto the node till VM Create + VM Initialize operations have completed successfully.