aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

High network usage to Kubernetes API

gnuletik opened this issue · comments

What happened:

We have the following kind of NetworkPolicy in our cluster:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: some-name
  namespace: app-namespace
spec:
  podSelector: {}
  ingress:
    - {}
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16

    - to:
        - ipBlock:
            cidr: some-cidr

    # allow kube-dns
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - port: 53
          protocol: TCP
        - port: 53
          protocol: UDP

    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: other-namespace
          podSelector:
            matchLabels:
              app: some-app
      ports:
        - protocol: TCP
          port: 8000

When there's ~500 nodes in the cluster, the aws-eks-nodeagent containers receives a lot of traffic from the Kubernetes API, around 300MB/s across all nodes.

This lead to a lot of packet drops by EC2, triggered by pps_allowance_exceeded errors.

Screenshot 2024-02-14 at 16 29 47

Attach logs

What you expected to happen:

Reduced network usage to avoid pps_allowance_exceeded errors.

How to reproduce it (as minimally and precisely as possible):

Setup a similar NetworkPolicy and add nodes to the cluster.

Environment:

  • Kubernetes version (use kubectl version): v1.28.5-eks-5e0fdde
  • CNI Version: v1.16.2
  • Network Policy Agent Version: v1.0.7
  • OS (e.g: cat /etc/os-release): EKS AMI v20240202
  • Kernel (e.g. uname -a): 5.10.205-195.807.amzn2

@gnuletik - What instance types are you using?

@gnuletik aws-eks-nodeagent watches on policyEndpoint resources. Network policy controller resolves the pod and namespace selectors in the provided policy specs and will propagate that information to individual agents (aws-eks-nodeagent) running on nodes via these resource objects so that they can setup appropriate firewall rules for the local pods on the node. Data transfer should only occur if there is a change in the cluster (i.e,) pod/node churn. So, the amount of data transferred to all the nodes is entirely dependent on your configured network policies, pod count and pod scale up/down events. We also split these resources if the no.of endpoints (ingress or egress) exceed 1000 (similar to endpoint slices) in to multiple sub resources, so that we limit the data that needs to be transferred across to individual nodes during any related events. With all that being said, we've recently implemented an enhancement in the Network Policy controller (that runs in EKS control plane) to aggregate IP/Port info to further reduce the size of individual policyEndpoint resources and that change should automatically reflect on your cluster(s) in few weeks..

@jayanthvn, we tested several instance types, especially the latest generation of network-optimized variants (c6in, m6in, r6in, m6idn, r6idn).

Using these instance types reduced the number of errors, but we still encountered many pps_allowance_exceeded errors with these instances.

@achevuru, thanks for the explanation! Our use-case involves running thousands of short-lived pods while using Karpenter to provide nodes. Although the creation/deletion of these short-lived pods should not update the network rules in the nodes (as they are not communicating with each other), it appears to generate traffic.

However, it's worth giving it another try after the Network Policy Controller fix is deployed!

@gnuletik Network Policy controller fix is now patched on all the EKS clusters. If you're setup is still intact, can you check and let us know if you observe any change around network usage?

Thanks for the feedback @achevuru!
I re-enabled the network policy agent two days ago and did not face pps_allowance_exceeded errors yet.
I'll let you know here if the issue comes back.
Thanks!

Sure let us know how it goes over the next week or so.

Closing the issue since not breaching PPS. Please feel free to re-open if the issue reoccurs..