aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting Failed to watch of *v1alpha1.PolicyEndpoint ended with: an error on the server after upgrading VPC CNI to v1.17.1+ version with aws-network-policy-agent v1.1.0

ArtemProskochylo opened this issue · comments

What happened:
After upgrading vpc-cni plugin to v1.17.1 and v1.18.0 versions I see a lot of errors for the aws-network-policy-agent container with v1.1.0 version. The issue is occurring even on fresh EKS installations where we are not using Network Policies.

Attach logs
W0424 08:27:34.397257 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: watch of *v1alpha1.PolicyEndpoint ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding

What you expected to happen:
No error messages.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy v1.29 EKS cluster
  2. Deploy VPC CNI Add-on v1.17.1-eksbuild.1 or v1.18.0-eksbuild.1 version.
  3. Run kubectl -n kube-system logs aws-node-*

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: v1.29.1
    Server Version: v1.29.1-eks-b9c9ed7
  • CNI Version: v1.17.1 and v1.18.0
  • Network Policy Agent Version: v1.1.0
  • OS (e.g: cat /etc/os-release): Bottlerocket OS 1.19.2 (aws-k8s-1.29)
  • Kernel (e.g. uname -a): 6.1.77

@ArtemProskochylo How did you upgrade the VPC CNI version? It appears that you're missing the required permissions for the aws-node pod. Did you apply the corresponding version specific manifest?

Facing the same issue after upgrading to EKS 1.29 with CNI 1.18.0.
@achevuru I upgraded the addon directly from AWS using Terraform. I checked the ClusterRole configuration and it has the permissions you referred to:

  • apiGroups:
    • networking.k8s.aws
      resources:
    • policyendpoints
      verbs:
    • get
    • list
    • watch

Seems like a bug.

@danielap-ma If you're seeing the same error as above - then either the permissions are missing (please check if CNI pods have correct SA in place) or there are connectivity issues with your API Server. I quickly tried it and I don't see any such issue(s) on my end.

@ArtemProskochylo How did you upgrade the VPC CNI version? It appears that you're missing the required permissions for the aws-node pod. Did you apply the corresponding version specific manifest?

Hi @achevuru
Sorry for the late response. It was also updated through Terraform. But in my case only add-on version was set through Terraform, configmaps, daemonset and other resources are managed by AWS. I have checked RBACs for vpc-cni v1.17.1 and required permissions are presented there:
`- apiGroups:

  • networking.k8s.aws
    resources:
  • policyendpoints
    verbs:
  • get
  • list
  • watch
  • apiGroups:
    • networking.k8s.aws
      resources:
    • policyendpoints/status
      verbs:
    • get`

But I still see the following error in logs for v1.17.1:
W0509 03:34:41.481449 1 reflector.go:462] pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/reflector.go:229: watch of *v1alpha1.PolicyEndpoint ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

In another cluster running the updated version v1.18.1, I do not see those errors. I suppose it is a version-specific issue.

I hope provided info will be useful for you.

Thanks