doitintl / kubeip

Assign static public IPs to Kubernetes nodes (GKE, EKS)

Home Page:https://kubeip.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KubeIP sporadically fails to assign the address on GCP because of missing access config check during delete step

ijozic opened this issue · comments

When on GCP, if e.g. a static IP address is manually detached from the node which then gets assigned another ephemeral IP address, kube IP will fail to attach the labeled static address as it will try to delete the new ephemeral address for the node as it is thinks the node lacks network access config, but the process to fetch might be bugged (I think I've listed the node's details and saw that it in fact has a network access config listed). If I rescale the cluster and thus create a new node, then the process works as expected.

{"error":"failed to delete current public IP address: failed to get instance network interface access config: instance network interface has no access configs","file":"/app/cmd/main.go:97","func":"main.assignAddress","level":"error","msg":"failed to assign static public IP address to node some node","time":"2023-12-14T09:24:23Z","version":"sha-ce43fbb"}

I can try to reproduce it again and list the node details to show that it does have the access config.

I'm getting the same issue

@ijozic How are you able to reproduce, I am getting the issue not on all nodes, but sometimes.

@ijozic @olivierboucher can you reproduce this? If so, could you provide clear steps on how to do so?

@ijozic @olivierboucher @eyalzek i have the same issue. In my case this happens when Kubernetes nodes upgrades. I have blue-green update enabled and have exactly the same amount of ip addresses as there are Kubernetes nodes. Let's say I have 5 nodes in my cluster I labeled 5 IPs to be used in that cluster. All my node pools are scaled to 1 and my batch node count is 1 and soak duration 3600.

@Pestilenciaa do you manage to consistently reproduce the error reported in the original post?

If so, could you share the YAML description of your node pools [1]?

[1] https://cloud.google.com/sdk/gcloud/reference/container/node-pools/describe

@eyalzek

It's been a while and I moved back to the previous version. If I remember correctly, the main issue was that the code didn't check for the network interface existence before deleting it. I could've opened a PR but lacked time at that moment.

@olivierboucher AFAICT the 2 places in the code where it interacts with the NIC (add/remove address), it's fetching the network interface or erroring out [1][2].

In any case, if you have the time to help with reproducing it'd be highly appreciated :)

[1] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L150-L153
[2] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L181-L184

@eyalzek

What I'm saying is that there should be a distinction between not being able to fetch the NIC and the instance having no NIC... therefore, the deletion should be skipped if there is no NIC to delete

@olivierboucher the function getNetworkInterface [1] should return an error if there is no NIC (or if it can't get it). And since this is the first check in both the functions I mentioned above, the deletion/adding of an address should be skipped if it fails.
Or did I misunderstand what you meant?

[1] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L369-L371

Yes but the problem is that the whole function errors out [1] when it should just complete and return nil if there is no NIC

[1] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L152

Well if it doesn't fail then the following logic in Assign/Unassign will, since it doesn't really make sense to proceed without the NIC. That's where the retry mechanism should do the trick.

Unless you're describing a scenario where KubeIP goes into a failing loop? If so, could you provide steps to reproduce?

This can happen when a new node is attached to a GKE cluster for whatever reason: node upgrade, scale-up, health restart, etc.
In this case, node info is updated with some delay and the KubeIP can fail to assign a static IP to the node.
To overcome this issue, the KubeIP retries to assign a static IP for (RETRY_INTERVAL X RETRY_ATTEMPTS): 1m x 60 == 1h (by default; and this parameters can be configured)

So, if node network interface details are not updated when the KubeIP runs the first time, the KubeIP will retry again and again until it succeed or max retry attempts achieved.

@olivierboucher @ijozic @Pestilenciaa @vini-intenseye can you try to reproduce the issue, wait a few minutes and see if all nodes got a static IP assigned? If not, please attach the KubeIP logs from problematic nodes.

#138 fixes this issue too

Tested with:

  • manual delete node - static address released and new node added taking released address
  • cluster resize - resize cluster (scale up and down); everything works as expected

Closing this issue for now. If you can reproduce it with the latest version, please reopen it and describe the process for reproducing it.