gardener / gardener-extension-provider-azure

Gardener extension controller for the Azure cloud provider (https://azure.microsoft.com).

Home Page:https://gardener.cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[CPM] Restoration of cluster fails if it's `Infrastructure` resource on the source `Seed` was annotated with `migration.azure.provider.extensions.gardener.cloud/zone`

plkokanov opened this issue · comments

How to categorize this issue?

/area control-plane-migration
/kind bug
/platform azure

What happened:
During control plane migration of an HA shoot cluster (using zones z1, z2, and z3), for which the infrastructure resource is annotated with migration.azure.provider.extensions.gardener.cloud/zone, the infrastructure resource is not successfully restored with the following error:

* creating Subnet: (Name "<vnet-name>-nodes-z3" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="NetcfgSubnetRangesOverlap" Message="Subnet '<vnet-name>-nodes-z3' is not valid because its IP address range overlaps with that of an existing subnet in virtual network '<vnet-name>'." Details=[]
  with azurerm_subnet.workers-z3,
  on main.tf line 167, in resource "azurerm_subnet" "workers-z3":
 167: resource "azurerm_subnet" "workers-z3" {
* deleting Subnet: (Name "<vnet-name>-nodes" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#Delete: Failure sending request: StatusCode=400 -- Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet<vnet-name>-nodes is in use by /subscriptions/<omitted>/resourceGroups/<resource-group-name>/providers/Microsoft.Network/networkInterfaces/<nic-id>-NIC/ipConfigurations/<nic-id>-NIC and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]]

Basically, during the restore phase of control plane migration for the inrastructure resource the provider-azure extension tried to delete the <vnet-name>-nodes subnet and create <vnet-name>-nodes-z3. This seems to have happened because the infrastructure resource in the destination seed did not have an migration.azure.provider.extensions.gardener.cloud/zone: "3" annotation.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

The migration.azure.provider.extensions.gardener.cloud/zone annotation is put on the infrastructure resource via a mutating webhook here:

for _, z := range newProviderCfg.Networks.Zones {
if z.CIDR == *oldProviderCfg.Networks.Workers {
extensionswebhook.LogMutation(logger, newInfra.Kind, newInfra.Namespace, newInfra.Name)
if newInfra.Annotations == nil {
newInfra.Annotations = make(map[string]string)
}
newInfra.Annotations[azuretypes.NetworkLayoutZoneMigrationAnnotation] = helper.InfrastructureZoneToString(z.Name)
return nil
}
}

In this case, this mutating code did not get executed because of the following:

  1. As part of normal reconciliation of the infrastructure resource its .status.providerStatus field is saved in the .status.state.providerStatus.
  2. During the migrate phase of CPM gardenlet takes this .status.state.savedProviderStatus and saves it in the ShootState
  3. During the restore phase of CPM gardenlet creates an infrastructure resource in the destination seed, then it copies the .status.state.savedProviderStatus from the ShootState and adds it to the infrastructure's .status.state.savedProviderStatuss field.
  4. Afterwards, gardenlet annotates the the infrastructure resource with gardener.cloud/operation: restore to trigger restoration.

During the updates to the infrastructure resource in 3 and 4 the mutating webhook does not make any changes as it exits early due to these checks:

if oldInfra.Status.ProviderStatus != nil {
oldProviderStatus, err = helper.InfrastructureStatusFromRaw(oldInfra.Status.ProviderStatus)
if err != nil {
return fmt.Errorf("could not mutate object: %v", err)
}
}
// take care of clusters that have not been reconciliated for a long time (hibernated etc). In this case they may
// not have the Layout field populated.
if oldProviderStatus != nil &&
oldProviderStatus.Networks.Layout != "" &&
oldProviderStatus.Networks.Layout != azure.NetworkLayoutSingleSubnet {
return nil
}

Even if the status.providerState is patched with the one from the status.state.providerState, the mutating webhook would still not perform any changes because the status.providerState would contain the following:

  "providerStatus": {
    "apiVersion": "azure.provider.extensions.gardener.cloud/v1alpha1",
    "availabilitySets": [],
    "kind": "InfrastructureStatus",
    "networks": {
      "layout": "MultipleSubnet",

Hence nil is returned here:

oldProviderStatus.Networks.Layout != azure.NetworkLayoutSingleSubnet {
return nil
}

What you expected to happen:
Cluster to be restored successfully.

Environment:

  • Gardener version (if relevant):
  • Extension version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

/assign