[CPM] Restoration of cluster fails if it's `Infrastructure` resource on the source `Seed` was annotated with `migration.azure.provider.extensions.gardener.cloud/zone`
plkokanov opened this issue · comments
How to categorize this issue?
/area control-plane-migration
/kind bug
/platform azure
What happened:
During control plane migration of an HA shoot cluster (using zones z1
, z2
, and z3
), for which the infrastructure resource is annotated with migration.azure.provider.extensions.gardener.cloud/zone
, the infrastructure resource is not successfully restored with the following error:
* creating Subnet: (Name "<vnet-name>-nodes-z3" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="NetcfgSubnetRangesOverlap" Message="Subnet '<vnet-name>-nodes-z3' is not valid because its IP address range overlaps with that of an existing subnet in virtual network '<vnet-name>'." Details=[]
with azurerm_subnet.workers-z3,
on main.tf line 167, in resource "azurerm_subnet" "workers-z3":
167: resource "azurerm_subnet" "workers-z3" {
* deleting Subnet: (Name "<vnet-name>-nodes" / Virtual Network Name "<vnet-name>" / Resource Group "<resource-group-name>"): network.SubnetsClient#Delete: Failure sending request: StatusCode=400 -- Original Error: Code="InUseSubnetCannotBeDeleted" Message="Subnet<vnet-name>-nodes is in use by /subscriptions/<omitted>/resourceGroups/<resource-group-name>/providers/Microsoft.Network/networkInterfaces/<nic-id>-NIC/ipConfigurations/<nic-id>-NIC and cannot be deleted. In order to delete the subnet, delete all the resources within the subnet. See aka.ms/deletesubnet." Details=[]]
Basically, during the restore
phase of control plane migration for the inrastructure resource the provider-azure
extension tried to delete the <vnet-name>-nodes
subnet and create <vnet-name>-nodes-z3
. This seems to have happened because the infrastructure resource in the destination seed did not have an migration.azure.provider.extensions.gardener.cloud/zone: "3"
annotation.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
The migration.azure.provider.extensions.gardener.cloud/zone
annotation is put on the infrastructure resource via a mutating webhook here:
gardener-extension-provider-azure/pkg/webhook/infrastructure/layout.go
Lines 132 to 141 in b859d7b
In this case, this mutating code did not get executed because of the following:
- As part of normal reconciliation of the infrastructure resource its
.status.providerStatus
field is saved in the.status.state.providerStatus
. - During the
migrate
phase of CPMgardenlet
takes this.status.state.savedProviderStatus
and saves it in theShootState
- During the
restore
phase of CPMgardenlet
creates an infrastructure resource in the destination seed, then it copies the.status.state.savedProviderStatus
from theShootState
and adds it to the infrastructure's.status.state.savedProviderStatuss
field. - Afterwards,
gardenlet
annotates the the infrastructure resource withgardener.cloud/operation: restore
to trigger restoration.
During the updates to the infrastructure resource in 3 and 4 the mutating webhook does not make any changes as it exits early due to these checks:
gardener-extension-provider-azure/pkg/webhook/infrastructure/layout.go
Lines 117 to 130 in b859d7b
Even if the status.providerState
is patched with the one from the status.state.providerState
, the mutating webhook would still not perform any changes because the status.providerState
would contain the following:
"providerStatus": {
"apiVersion": "azure.provider.extensions.gardener.cloud/v1alpha1",
"availabilitySets": [],
"kind": "InfrastructureStatus",
"networks": {
"layout": "MultipleSubnet",
Hence nil is returned here:
gardener-extension-provider-azure/pkg/webhook/infrastructure/layout.go
Lines 128 to 130 in b859d7b
What you expected to happen:
Cluster to be restored successfully.
Environment:
- Gardener version (if relevant):
- Extension version:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- Others:
/assign