EKS 1.28 latest CNI - failed to assign an IP address to container - large private subnets

Question

EKS 1.28 latest CNI - failed to assign an IP address to container - large private subnets

mrocheleau opened this issue 4 months ago · comments

What happened:
Since updating to EKS 1.28 (from 1.26, via 1.27) and updating the vpc-cni to be a managed (and most current) version, we are seeing sporadic failures to assign IP addresses to new pods. This largely is happening when we do a deploy which spins up many new pods across many namespaces, say to the tune of 4-6 pods in 10-15 namespaces. Roughly a growth of ~100 new pods within a few minutes. Nothing huge though.

The pods will oft times log an event of:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "47242e4c489f3d21d94103cf92d78cca90cb6467fbe8eb7670c77e86c80e09c6": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

Unfortunately our deploy system watches for error events and rolls back when it sees this, so I lose the pod info and what node it was on at the time to get direct logs. But in checking a random node's ipamd.log I see some blocks like this during when we have troubles like the above:

{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"lighthouse-vishnu-api-68dfc649f4-qnbgj\"  K8S_POD_NAMESPACE:\"lighthouse\"  K8S_POD_INFRA_CONTAINER_ID:\"26c6f655654a4e085a118aaf4a966d67cbf47469f807ae7dc6291b19697d4658\"  ContainerID:\"26c6f655654a4e085a118aaf4a966d67cbf47469f807ae7dc6291b19697d4658\"  IfName:\"eth0\"  NetworkName:\"aws-cni\"  Netns:\"/var/run/netns/cni-ea424685-d396-48ba-dc83-e5c1f10079b8\""}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:607","msg":"AssignPodIPv4Address: IP address pool stats: total 18, assigned 8"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 172.28.15.192/ffffffff"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 172.28.15.192/ffffffff"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 172.28.14.126/ffffffff"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 172.28.14.126/ffffffff"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:687","msg":"Get free IP from prefix failed no free IP available in the prefix - 172.28.14.119/ffffffff"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:607","msg":"Unable to get IP address from CIDR: no free IP available in the prefix - 172.28.14.119/ffffffff"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:1291","msg":"Found a free IP not in DB - 172.28.12.25"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:687","msg":"Returning Free IP 172.28.12.25"}
{"level":"debug","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:607","msg":"New IP from CIDR pool- 172.28.12.25"}
{"level":"info","ts":"2024-02-21T20:39:45.367Z","caller":"datastore/data_store.go:714","msg":"assignPodIPAddressUnsafe: Assign IP 172.28.12.25 to sandbox aws-cni/26c6f655654a4e085a118aaf4a966d67cbf47469f807ae7dc6291b19697d4658/eth0"}

The subnet in question here is 172.28.12.0/22 so 1024 addresses, and during the event showed 800+ free IP's. I see it did find one in the end here, but how did it not find one in the first ranges it checked?

If I can isolate which node is having a problem assigning IP's directly to a pod that's showing the event I'll upload the full logs. But I don't understand how in a block with 80%+ free IP's in this case how any of the smaller blocks inside the subnet cidr could possibly have "no free IP".

Environment:

Kubernetes version (use kubectl version): v1.28.5-eks-5e0fdde
CNI Version v1.16.2-eksbuild.1
OS (e.g: cat /etc/os-release): MacOS locally, Linux containers using AL2
Kernel (e.g. uname -a): Darwin Kernel Version 23.2.0

Jeffrey Nelson · Answer 1 · Thu Feb 22 2024 06:20:20 GMT+0800 (China Standard Time)

@mrocheleau New ENIs are attached to the node as the number of pods scheduled on the node increases. So when there is a rapid scale-up, pods are sometimes stuck waiting for new ENIs to be attached (and their IPs to be usable). The number of ENIs that are initially attached and the rate at which they are attached is controlled by the warm/minimum IP targets.

You can increase the WARM_IP_TARGET and MINIMUM_IP_TARGET to allocate more aggressively, but there is always the possibility of a failure.

EC2 API calls could fail at any time (and be retried), so rolling back on any error events is a bad idea.

mrocheleau · Answer 2 · Thu Feb 22 2024 06:59:18 GMT+0800 (China Standard Time)

@jdn5126 Alright that gives me something to work with, thank you.

We already have WARM_ENI_TARGET set to 1 (the default setting) and do not have WARM_IP_TARGET or MINIMUM_IP_TARGET set at all. For a cluster with say ~500 pods across ~18 nodes, is the suggestion still to define those two settings with some value? I see guidance https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/eni-and-ip-target.md here to only set WARM_IP_TARGET for "small clusters" without significant pod churn. Likely this qualifies as small in the grand scheme of things, but would appreciate your take.

We run m5.large, so 3 ENI's at 10 IP's a piece (I believe?) and limit max pods at 29 per node (default). Based on WARM_ENI_TARGET being defined already doesn't this mean we already have 10 IP's warm and ready per node? Or is the WARM ENI != to WARM IPs?

Unsure what these should be set to, to give us some extra breathing room. If I were to have 30 IP's per node warmed and ready (WARM_IP_TARGET = 30), that seems high if we happened to spin up 3x the amount of nodes we normally run we could run out of IP's...but is that the only concern? Is there a reason to set both WARM_IP_TARGET and MINIMUM_IP_TARGET or can I set one or the other?

Thanks so much!

mrocheleau · Answer 3 · Thu Feb 22 2024 07:36:20 GMT+0800 (China Standard Time)

Would it be outlandish to just set WARM_ENI_TARGET = 2? This should pre-reserve all 30 IP's then, if two secondary ENI's are warmed and one primary each with 10 IP's. And does that differ in api calls/performance vs setting WARM_IP_TARGET and such to the same effective IP amount?

Jeffrey Nelson · Answer 4 · Fri Feb 23 2024 00:27:25 GMT+0800 (China Standard Time)

@mrocheleau yeah, the tradeoff for over-provisioning IPs is mainly just resource consumption (subnets), as you do not get charged for private IPs. WARM_ENI_TARGET results in less EC2 API calls than MINIMUM_IP_TARGET+WARM_IP_TARGET, but IPs are allocated in larger chunks. Setting WARM_ENI_TARGET=2 sounds very reasonable for your case.

As a side note, on the subnet exhaustion side, #2714 should render this a non-issue going forward, as you can always provision and tag more private subnets and IPAMD will automatically use them.

On the warm/minimum targets side, removing the need to ever touch or worry about these values is our main focus for 2024. Customers should never have to worry about these things, they should "just work". That being said, APIs can still fail at any time, especially when things are designed to be eventually consistent, so rolling back on any error events still seems like a bad idea.

mrocheleau · Answer 5 · Fri Feb 23 2024 00:40:18 GMT+0800 (China Standard Time)

@jdn5126 Great, thanks so much - I sure do like things that "just work" so that's awesome to hear. Appreciate the feedback as well, I did try WARM_ENI_TARGET=2 late yesterday actually and resulted in still a few ip address allocation errors just as before.

I'll try out a few more setting changes today but sounds like avoiding the rollback on all errors is a good way to go, yeah. We can trap for a stalled deployment rollout as a means to ensure it worked instead of checking for pod event errors, in a case where we somehow cannot provision a network for a pod at all it'd just be stalled and not progressing.

github-actions · Answer 6 · Fri Feb 23 2024 00:40:34 GMT+0800 (China Standard Time)

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Jeffrey Nelson · Answer 7 · Fri Feb 23 2024 00:44:28 GMT+0800 (China Standard Time)

@jdn5126 Great, thanks so much - I sure do like things that "just work" so that's awesome to hear. Appreciate the feedback as well, I did try WARM_ENI_TARGET=2 late yesterday actually and resulted in still a few ip address allocation errors just as before.

I'll try out a few more setting changes today but sounds like avoiding the rollback on all errors is a good way to go, yeah. We can trap for a stalled deployment rollout as a means to ensure it worked instead of checking for pod event errors, in a case where we somehow cannot provision a network for a pod at all it'd just be stalled and not progressing.

Yeah, I think time has to be part of the condition. If a pod scheduled on a node is not assigned an IP for 15 mins, that strongly implies that the upgrade caused an issue. It's a tough science to gauge when things are working correctly but slowly vs not working.