Cluster with virtual kubelet blocking NEG sync
marwanad opened this issue · comments
We have a cluster that has some VK nodes (Those VK nodes have no provider ids). After a GKE upgrade (which moved the ingress pods) to new hosts, we got the below error on the ingress Service
with the NEGs failing to add any endpoints.
Warning SyncNetworkEndpointGroupFailed 35m (x10 over 27h) neg-controller Failed to sync NEG "k8s1-endpoint-bla" (will not retry): Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-endpoint-bla"} not valid for zonal resource NetworkEndpointGroup k8s1-endpoint-bla
We tracked this to be the below codepath:
ingress-gce/pkg/neg/syncers/utils.go
Line 703 in 51ddd0b
After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:
ingress-gce/pkg/utils/zonegetter/zone_getter.go
Lines 163 to 175 in 51ddd0b
ingress-gce/pkg/utils/zonegetter/zone_getter.go
Lines 115 to 118 in 51ddd0b
GKE version: v1.27.11-gke.1118000
/kind bug
After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:
I think GKE 1.27 might not have these changes yet (so the latest code from master here may not be entirely representative of all GKE versions)
/cc @songrx1997
/cc @swetharepakula
We've seen another failure mode where the controller would fail to sync IPs and the LB backends end up with stale endpoints.
We've hit the above with 1.28.8-gke.1095000
(although the nodes were on 1.27)
Warning SyncNetworkEndpointGroupFailed 33s (x7 over 2m23s) neg-controller Failed to sync NEG "k8s1-blaxxxx" (will retry): failed to get current NEG endpoints: Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-blaxxxx"} not valid for zonal resource NetworkEndpointGroup k8s1-blaxxxx
The fix is made available starting 1.29.1-gke.1119000+. We have just backported to Ingress 1.26 which will be released to GKE 1.28 in the next few weeks. We will include a release note when we do
@swetharepakula seems like upgrading to 1.29 did the trick. I am slightly confused by this comment "Ingress 1.26 which will be released to GKE 1.28 in the next few weeks" - what's the current versioning chart between the release-xx
branches and what's running on GKE? I was expecting release-1.28
to be what's on GKE 1.28 but that doesn't seem to be the case?
The README.me
used to be updated but hasn't been updated for long. Knowing this information would be great for debugging things and mitigating things on our end before we escalate to support.