CrashloopBackoff after upgrade to 1.17.1

Question

CrashloopBackoff after upgrade to 1.17.1

James-Quigley opened this issue 2 months ago · comments

What happened:
certain Pods Crashlooping after upgrading versions of aws-vpc-cni

Attach logs

2024-04-24 11:12:32.290147858 +0000 UTC Logger.check error: failed to get caller
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xaaaaea2a319c]
goroutine 99 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:116 +0x1a4
panic({0xaaaaeb14afa0?, 0xaaaaec6eade0?})
	/root/sdk/go1.21.7/src/runtime/panic.go:914 +0x218
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).configureeBPFProbes(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {0x40006027d0, 0x44}, {0x4000644d40?, 0x1, 0x0?}, {0x40006bd100, 0x2, ...}, ...)
	/workspace/controllers/policyendpoints_controller.go:292 +0x34c
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcilePolicyEndpoint(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, 0x4000655520)
	/workspace/controllers/policyendpoints_controller.go:266 +0x58c
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).reconcile(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {{{0x4000489f80, 0x1c}, {0x400005bdc0, 0x32}}})
	/workspace/controllers/policyendpoints_controller.go:145 +0x1a4
github.com/aws/aws-network-policy-agent/controllers.(*PolicyEndpointsReconciler).Reconcile(0x40004de000, {0xaaaaeb51dff8, 0x4000794b10}, {{{0x4000489f80, 0x1c}, {0x400005bdc0, 0x32}}})
	/workspace/controllers/policyendpoints_controller.go:126 +0xe4
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xaaaaeb520850?, {0xaaaaeb51dff8?, 0x4000794b10?}, {{{0x4000489f80?, 0xb?}, {0x400005bdc0?, 0x0?}}})
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:119 +0x8c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0x40000e68c0, {0xaaaaeb51e030, 0x400001f450}, {0xaaaaeb255960?, 0x40004e6a80?})
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:316 +0x2e8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0x40000e68c0, {0xaaaaeb51e030, 0x400001f450})
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:266 +0x16c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:227 +0x74
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 78
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.0/pkg/internal/controller/controller.go:223 +0x43c

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
I don't have a specific reproduction. It seems to only happen occasionally and only on certain pods/nodes. I haven't been able to determine yet exactly what the cause is.

Anything else we need to know?:
Might be related to aws/amazon-vpc-cni-k8s#2562

We have network policies currently disabled, both in the configmap and the command line flags.

Environment:

Kubernetes version (use kubectl version): 1.26
CNI Version: 1.17.1
OS (e.g: cat /etc/os-release): Bottlerocket
Kernel (e.g. uname -a): Linux <name> 5.15.148 aws/amazon-vpc-cni-k8s#1 SMP Fri Feb 23 02:47:29 UTC 2024 x86_64 GNU/Linux

Apurup Chevuru · Answer 1 · Thu Apr 25 2024 01:53:33 GMT+0800 (China Standard Time)

@ James-Quigley It looks like you've stale Network Policy and/or PolicyEndpoint resources in your cluster although NP is disabled (kubectl get networkpolicies -A /kubectl get policyendpoints -A)

James Quigley · Answer 2 · Thu Apr 25 2024 02:33:07 GMT+0800 (China Standard Time)

@achevuru what makes them "stale"? And how can that be resolved? Or are you saying that we can't have any NetworkPolicies in the cluster if we're going to have netpols disabled on the cni daemonset?

Apurup Chevuru · Answer 3 · Thu Apr 25 2024 02:51:48 GMT+0800 (China Standard Time)

When you enable Network Policy in VPC CNI, Network Policy controller will resolve the selectors in the NP spec and will create an intermediate custom resource called PolicyEndpoints. This resource is specific to the Network Policy implementation of VPC CNI. If you then disable NP in VPC CNI, we need to make sure these resources are cleared up. If not, there can be stale firewall rules enforced on pod interfaces resulting in unexpected behavior. Deleting the NP resources will clear out the corresponding PolicyEndpoint resources. These resources are stale because they reflect the state of the endpoints when the feature was enabled in the cluster.

You can have Network Policy resources in your cluster with NP disabled in VPC CNI. But if you enabled it in VPC CNI and are now trying to disable it, we need to clear out the resources. So,

Delete NPs
Disable NP feature in configMap and NP agent

You can then reconfigure your NPs..(if you want another NP solution to act on them)

James Quigley · Answer 4 · Thu Apr 25 2024 03:01:27 GMT+0800 (China Standard Time)

Would deleting the PolicyEndpoints resources fix the problem? Also why does this only seem to happen sometimes, for some specific pods?

Zygimantas · Answer 5 · Tue May 21 2024 22:59:09 GMT+0800 (China Standard Time)

+1, deleting the PolicyEndpoints fixed. are there any plans to resolve the nil pointer deference / delete PolicyEndpoints on NP disabling?

Paul Forman · Answer 6 · Wed May 29 2024 09:31:28 GMT+0800 (China Standard Time)

+1, stumbled on this while trying to disable NetworkPolicies during a failed rollout. Secret behaviors like this are not fun, and a hard panic seems like a poor way for the CNI agent to handle it.

Apurup Chevuru · Answer 7 · Wed May 29 2024 13:11:55 GMT+0800 (China Standard Time)

As I called out above, we took this approach to prevent stale resources in the cluster. Network Policy controller and agent should be allowed to clear out the ebpf probes configured against individual pods to enforce the NPs. Hard failures will alert the users about stale firewall rules that are still active against running pods. So, the recommended way to disable the NP feature is to follow the above sequence (Delete NPs and then Disable NP feature in configMap and NP agent)

Paul Forman · Answer 8 · Wed May 29 2024 22:34:20 GMT+0800 (China Standard Time)

Ok, so it's a choice. But maybe a log stating that instead of just a panic might help the user, since it appears this behavior is only really documented in this thread?

There's no mention of this in your README, nor in the AWS docs at https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html

It's not obvious to go delete a bunch of resources in order to change an enforcement flag from true to false, so a little help here would be appreciated.

Apurup Chevuru · Answer 9 · Thu May 30 2024 01:08:41 GMT+0800 (China Standard Time)

Fair enough. We will call this out clearly in the documentation.

Andrew Murphy · Answer 10 · Thu May 30 2024 15:51:35 GMT+0800 (China Standard Time)

We also hit this issue when attempting to disable enforcement via VPC CNI, leading to a failed apply of the configuration update via EKS, and crashlooping aws-node pods. Even AWS support were unaware of this limitation and it took a significant amount of time to track down the fault.
Both explicit mention of the limitation in the documentation and some more useful log messages would have saved a lot of wasted time.