Getting Failed to watch of *v1alpha1.PolicyEndpoint ended with: an error on the server after upgrading VPC CNI to v1.17.1+ version with aws-network-policy-agent v1.1.0

Question

Getting Failed to watch of *v1alpha1.PolicyEndpoint ended with: an error on the server after upgrading VPC CNI to v1.17.1+ version with aws-network-policy-agent v1.1.0

ArtemProskochylo opened this issue 3 months ago · comments

Artem Proskochylo commented 3 months ago

What happened:
After upgrading vpc-cni plugin to v1.17.1 and v1.18.0 versions I see a lot of errors for the aws-network-policy-agent container with v1.1.0 version. The issue is occurring even on fresh EKS installations where we are not using Network Policies.

Attach logs
W0424 08:27:34.397257 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: watch of *v1alpha1.PolicyEndpoint ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

What you expected to happen:
No error messages.

How to reproduce it (as minimally and precisely as possible):

Deploy v1.29 EKS cluster
Deploy VPC CNI Add-on v1.17.1-eksbuild.1 or v1.18.0-eksbuild.1 version.
Run kubectl -n kube-system logs aws-node-*

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Client Version: v1.29.1
Server Version: v1.29.1-eks-b9c9ed7
CNI Version: v1.17.1 and v1.18.0
Network Policy Agent Version: v1.1.0
OS (e.g: cat /etc/os-release): Bottlerocket OS 1.19.2 (aws-k8s-1.29)
Kernel (e.g. uname -a): 6.1.77

Apurup Chevuru · Answer 1 · Tue Jun 04 2024 05:01:05 GMT+0800 (China Standard Time)

@ArtemProskochylo How did you upgrade the VPC CNI version? It appears that you're missing the required permissions for the aws-node pod. Did you apply the corresponding version specific manifest?

danielap-ma · Answer 2 · Wed Jun 19 2024 16:04:41 GMT+0800 (China Standard Time)

Facing the same issue after upgrading to EKS 1.29 with CNI 1.18.0.
@achevuru I upgraded the addon directly from AWS using Terraform. I checked the ClusterRole configuration and it has the permissions you referred to:

apiGroups:
- networking.k8s.aws
  resources:
- policyendpoints
  verbs:
- get
- list
- watch

Seems like a bug.

Apurup Chevuru · Answer 3 · Thu Jun 20 2024 00:46:23 GMT+0800 (China Standard Time)

@danielap-ma If you're seeing the same error as above - then either the permissions are missing (please check if CNI pods have correct SA in place) or there are connectivity issues with your API Server. I quickly tried it and I don't see any such issue(s) on my end.

Artem Proskochylo · Answer 4 · Thu Jun 20 2024 04:06:00 GMT+0800 (China Standard Time)

@ArtemProskochylo How did you upgrade the VPC CNI version? It appears that you're missing the required permissions for the aws-node pod. Did you apply the corresponding version specific manifest?

Hi @achevuru
Sorry for the late response. It was also updated through Terraform. But in my case only add-on version was set through Terraform, configmaps, daemonset and other resources are managed by AWS. I have checked RBACs for vpc-cni v1.17.1 and required permissions are presented there:
`- apiGroups:

networking.k8s.aws
resources:
policyendpoints
verbs:
get
list
watch
apiGroups:
- networking.k8s.aws
  resources:
- policyendpoints/status
  verbs:
- get`

But I still see the following error in logs for v1.17.1:
W0509 03:34:41.481449 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: watch of *v1alpha1.PolicyEndpoint ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

In another cluster running the updated version v1.18.1, I do not see those errors. I suppose it is a version-specific issue.

I hope provided info will be useful for you.

Thanks