Intermittent Azure API fault results in zombie NatGateway and persistent shoot creation failure

Question

Intermittent Azure API fault results in zombie NatGateway and persistent shoot creation failure

andrerun opened this issue a year ago · comments

How to categorize this issue?
/area robustness
/kind bug
/platform azure

This ticket tracks an issue for which a short term technical solution is not possible. It has however caused both substantial pain and perception of poor Gardener robustness to one or more customers. A customer experiences multiple persistent shoot creation failures, and are forced to perform manual cleanup of infrastructure objects created by Gardener. The goal of this ticket is to communicate customer impact, and potentially drive/inform a longer-term change.

What happened:
In the context of a shoot creation workflow, Azure reported a NatGateway creation failure due to throttling, and created a NatGateway object with failed state. Terraform did not adopt the newly created gateway. The gateway object was abandoned as a zombie which would not be deleted by Gardener, and whose clashing name disrupts further attempts by Gardener to create a NatGateway required as part of shoot creation. The outcome is a shoot with persistently failed creation, plus infrastructure object which requires manual cleanup.

The presumed Azure throttling restriction is subscription-specific, so an occurrence affects a single Gardener customer, but in an automated scenario, is likely to result in multiple failed shoots for that customer.

The problem cannot be immediately resolved in Gardener, because the underlying cause, as currently understood, is a conflict between Azure's failure mode in that specific scenario, and Terraform. TBD: A more precise description of these underlying mechanics is to be added to this ticket shortly.

Anything else we need to know?:

Environment:

Gardener version (if relevant):
Extension version: TBD
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

Andrey Anastasov · Answer 1 · Fri Apr 28 2023 21:55:25 GMT+0800 (China Standard Time)

@kon-angelo can you please add a short description of the specific mechanics which cause Terraform not to adopt the created NatGateway, and which make a fix in Gardener code impractical?

Konstantinos Angelopoulos · Answer 2 · Thu May 04 2023 22:49:02 GMT+0800 (China Standard Time)

@andrerun Terraform templates have 2 directives to declare resources resource and data. The first one instructs TF to create and manage a resource, then latter to adopt it. You can see that for our use case we use the resource directive.

In the scenario that you describe, Azure tries to provision a NGW but the provisioning fails. I am not 100% sure about the original Terraform run (error or timeout), but after the first run there is a NGW resource listed on Azure which is not imported into the TF state because the creation call was unsuccessful. All subsequent Terraform runs are thus failing because TF will complain for any existing resource that is not in its state and you are essentially deadlocked.

which make a fix in Gardener code impractical?

The communication between TF and Azure is opaque from Gardener's perspective. We simply declare the target state and let TF perform the operations. As you see in this case there are some edge cases that prevent them from completing.

By any means, while the throttling is active no operation such as update or delete can go through, hence there is no way to proceed.
What would be ideal is for Gardener to have a way to break the deadlock post-incident. So what can we do ?

The first suggestion would be to try and adopt the created resources.

AFAIK, there is no officially documented way to "create and/or adopt" resources (rather its a XOR per the directives described above).
You can perform this operation manually by using "terraform import state" commands. But you have to remember that TF is a tool primarily designed to be run in the command line and currently we don't have the automation in place at the moment intelligently parse the command output and directly importing that NGW into the state.

The other thing that you could do is directly delete the resource in Azure but again this requires quite a bit of effort to orchestrate similar to my previous point.

The TLDR; is that for the time being we heavily rely on Terraform. TF is useful to declaratively manage infra resources, but at the same time we do not have the ability to intervene and change its behavior much. Introducing workarounds like the ones I mentioned above as a wrapper around current terraform is likely to cause more issues than solve. Because the workarounds discussed here require a lot of effort to integrate into the extension, instead the likely solution would be to proceed with our terraform removal story where we could have more control over such incidents.