KubeIP sporadically fails to assign the address on GCP because of missing access config check during delete step

Question

KubeIP sporadically fails to assign the address on GCP because of missing access config check during delete step

ijozic opened this issue 6 months ago · comments

When on GCP, if e.g. a static IP address is manually detached from the node which then gets assigned another ephemeral IP address, kube IP will fail to attach the labeled static address as it will try to delete the new ephemeral address for the node as it is thinks the node lacks network access config, but the process to fetch might be bugged (I think I've listed the node's details and saw that it in fact has a network access config listed). If I rescale the cluster and thus create a new node, then the process works as expected.

{"error":"failed to delete current public IP address: failed to get instance network interface access config: instance network interface has no access configs","file":"/app/cmd/main.go:97","func":"main.assignAddress","level":"error","msg":"failed to assign static public IP address to node some node","time":"2023-12-14T09:24:23Z","version":"sha-ce43fbb"}

I can try to reproduce it again and list the node details to show that it does have the access config.

Olivier Boucher · Answer 1 · Fri Jan 12 2024 00:15:00 GMT+0800 (China Standard Time)

I'm getting the same issue

Vini Vas · Answer 2 · Mon Jan 15 2024 15:14:43 GMT+0800 (China Standard Time)

@ijozic How are you able to reproduce, I am getting the issue not on all nodes, but sometimes.

Eyal Zekaria · Answer 3 · Thu Feb 22 2024 17:45:02 GMT+0800 (China Standard Time)

@ijozic @olivierboucher can you reproduce this? If so, could you provide clear steps on how to do so?

Recep Guleryuz · Answer 4 · Thu Feb 22 2024 22:52:12 GMT+0800 (China Standard Time)

@ijozic @olivierboucher @eyalzek i have the same issue. In my case this happens when Kubernetes nodes upgrades. I have blue-green update enabled and have exactly the same amount of ip addresses as there are Kubernetes nodes. Let's say I have 5 nodes in my cluster I labeled 5 IPs to be used in that cluster. All my node pools are scaled to 1 and my batch node count is 1 and soak duration 3600.

Eyal Zekaria · Answer 5 · Fri Feb 23 2024 11:41:19 GMT+0800 (China Standard Time)

@Pestilenciaa do you manage to consistently reproduce the error reported in the original post?

If so, could you share the YAML description of your node pools [1]?

[1] https://cloud.google.com/sdk/gcloud/reference/container/node-pools/describe

Olivier Boucher · Answer 6 · Fri Feb 23 2024 21:28:35 GMT+0800 (China Standard Time)

@eyalzek

It's been a while and I moved back to the previous version. If I remember correctly, the main issue was that the code didn't check for the network interface existence before deleting it. I could've opened a PR but lacked time at that moment.

Eyal Zekaria · Answer 7 · Mon Feb 26 2024 12:25:19 GMT+0800 (China Standard Time)

@olivierboucher AFAICT the 2 places in the code where it interacts with the NIC (add/remove address), it's fetching the network interface or erroring out [1][2].

In any case, if you have the time to help with reproducing it'd be highly appreciated :)

[1] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L150-L153
[2] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L181-L184

Olivier Boucher · Answer 8 · Mon Feb 26 2024 21:01:32 GMT+0800 (China Standard Time)

@eyalzek

What I'm saying is that there should be a distinction between not being able to fetch the NIC and the instance having no NIC... therefore, the deletion should be skipped if there is no NIC to delete

Eyal Zekaria · Answer 9 · Tue Feb 27 2024 11:04:57 GMT+0800 (China Standard Time)

@olivierboucher the function getNetworkInterface [1] should return an error if there is no NIC (or if it can't get it). And since this is the first check in both the functions I mentioned above, the deletion/adding of an address should be skipped if it fails.
Or did I misunderstand what you meant?

[1] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L369-L371

Olivier Boucher · Answer 10 · Tue Feb 27 2024 21:15:14 GMT+0800 (China Standard Time)

Yes but the problem is that the whole function errors out [1] when it should just complete and return nil if there is no NIC

[1] https://github.com/doitintl/kubeip/blob/master/internal/address/gcp.go#L152

Eyal Zekaria · Answer 11 · Wed Feb 28 2024 10:35:55 GMT+0800 (China Standard Time)

Well if it doesn't fail then the following logic in Assign/Unassign will, since it doesn't really make sense to proceed without the NIC. That's where the retry mechanism should do the trick.

Unless you're describing a scenario where KubeIP goes into a failing loop? If so, could you provide steps to reproduce?

Alexei Ledenev · Answer 12 · Sun Mar 24 2024 18:20:01 GMT+0800 (China Standard Time)

This can happen when a new node is attached to a GKE cluster for whatever reason: node upgrade, scale-up, health restart, etc.
In this case, node info is updated with some delay and the KubeIP can fail to assign a static IP to the node.
To overcome this issue, the KubeIP retries to assign a static IP for (RETRY_INTERVAL X RETRY_ATTEMPTS): 1m x 60 == 1h (by default; and this parameters can be configured)

So, if node network interface details are not updated when the KubeIP runs the first time, the KubeIP will retry again and again until it succeed or max retry attempts achieved.

@olivierboucher @ijozic @Pestilenciaa @vini-intenseye can you try to reproduce the issue, wait a few minutes and see if all nodes got a static IP assigned? If not, please attach the KubeIP logs from problematic nodes.

Alexei Ledenev · Answer 13 · Wed Mar 27 2024 23:49:59 GMT+0800 (China Standard Time)

#138 fixes this issue too

Tested with:

manual delete node - static address released and new node added taking released address
cluster resize - resize cluster (scale up and down); everything works as expected

Closing this issue for now. If you can reproduce it with the latest version, please reopen it and describe the process for reproducing it.