kubernetes / ingress-gce

Ingress controller for Google Cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cluster with virtual kubelet blocking NEG sync

marwanad opened this issue · comments

We have a cluster that has some VK nodes (Those VK nodes have no provider ids). After a GKE upgrade (which moved the ingress pods) to new hosts, we got the below error on the ingress Service with the NEGs failing to add any endpoints.

Warning  SyncNetworkEndpointGroupFailed  35m (x10 over 27h)  neg-controller         Failed to sync NEG "k8s1-endpoint-bla" (will not retry): Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-endpoint-bla"} not valid for zonal resource NetworkEndpointGroup k8s1-endpoint-bla 

We tracked this to be the below codepath:

return nil, nil, fmt.Errorf("Failed to lookup NEG in zone %q, candidate zones %v, err - %w", zone, candidateZonesMap, err)

After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:

func getZone(node *api_v1.Node) (string, error) {
if node.Spec.ProviderID == "" {
return "", fmt.Errorf("%w: node %s does not have providerID", ErrProviderIDNotFound, node.Name)
}
matches := providerIDRE.FindStringSubmatch(node.Spec.ProviderID)
if len(matches) != 4 {
return "", fmt.Errorf("%w: providerID %q of node %s is not valid", ErrSplitProviderID, node.Spec.ProviderID, node.Name)
}
if matches[2] == "" {
return "", fmt.Errorf("%w: node %s has an empty zone", ErrSplitProviderID, node.Name)
}
return matches[2], nil
}

if err != nil {
logger.Error(err, "Failed to get zone from providerID", "nodeName", n.Name)
continue
}

GKE version: v1.27.11-gke.1118000

/kind bug

After removing the virtual nodes, the NEGs sync again but the issue is a bit confusing because how does an empty zone string make it there:

I think GKE 1.27 might not have these changes yet (so the latest code from master here may not be entirely representative of all GKE versions)

/cc @songrx1997
/cc @swetharepakula

We've seen another failure mode where the controller would fail to sync IPs and the LB backends end up with stale endpoints.

We've hit the above with 1.28.8-gke.1095000 (although the nodes were on 1.27)

  Warning  SyncNetworkEndpointGroupFailed  33s (x7 over 2m23s)  neg-controller         Failed to sync NEG "k8s1-blaxxxx" (will retry): failed to get current NEG endpoints: Failed to lookup NEG in zone "", candidate zones map[us-central1-a:{} us-central1-b:{} us-central1-c:{} us-central1-f:{}], err - Key Key{"k8s1-blaxxxx"} not valid for zonal resource NetworkEndpointGroup k8s1-blaxxxx

The fix is made available starting 1.29.1-gke.1119000+. We have just backported to Ingress 1.26 which will be released to GKE 1.28 in the next few weeks. We will include a release note when we do

@swetharepakula seems like upgrading to 1.29 did the trick. I am slightly confused by this comment "Ingress 1.26 which will be released to GKE 1.28 in the next few weeks" - what's the current versioning chart between the release-xx branches and what's running on GKE? I was expecting release-1.28 to be what's on GKE 1.28 but that doesn't seem to be the case?

The README.me used to be updated but hasn't been updated for long. Knowing this information would be great for debugging things and mitigating things on our end before we escalate to support.