High network usage to Kubernetes API

Question

High network usage to Kubernetes API

gnuletik opened this issue 5 months ago · comments

What happened:

We have the following kind of NetworkPolicy in our cluster:

kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: some-name
  namespace: app-namespace
spec:
  podSelector: {}
  ingress:
    - {}
  egress:
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
              - 10.0.0.0/8
              - 172.16.0.0/12
              - 192.168.0.0/16

    - to:
        - ipBlock:
            cidr: some-cidr

    # allow kube-dns
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - port: 53
          protocol: TCP
        - port: 53
          protocol: UDP

    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: other-namespace
          podSelector:
            matchLabels:
              app: some-app
      ports:
        - protocol: TCP
          port: 8000

When there's ~500 nodes in the cluster, the aws-eks-nodeagent containers receives a lot of traffic from the Kubernetes API, around 300MB/s across all nodes.

This lead to a lot of packet drops by EC2, triggered by pps_allowance_exceeded errors.

Attach logs

What you expected to happen:

Reduced network usage to avoid pps_allowance_exceeded errors.

How to reproduce it (as minimally and precisely as possible):

Setup a similar NetworkPolicy and add nodes to the cluster.

Environment:

Kubernetes version (use kubectl version): v1.28.5-eks-5e0fdde
CNI Version: v1.16.2
Network Policy Agent Version: v1.0.7
OS (e.g: cat /etc/os-release): EKS AMI v20240202
Kernel (e.g. uname -a): 5.10.205-195.807.amzn2

Jayanth Varavani · Answer 1 · Thu Feb 15 2024 14:27:31 GMT+0800 (China Standard Time)

@gnuletik - What instance types are you using?

Apurup Chevuru · Answer 2 · Thu Feb 15 2024 15:31:41 GMT+0800 (China Standard Time)

@gnuletik aws-eks-nodeagent watches on policyEndpoint resources. Network policy controller resolves the pod and namespace selectors in the provided policy specs and will propagate that information to individual agents (aws-eks-nodeagent) running on nodes via these resource objects so that they can setup appropriate firewall rules for the local pods on the node. Data transfer should only occur if there is a change in the cluster (i.e,) pod/node churn. So, the amount of data transferred to all the nodes is entirely dependent on your configured network policies, pod count and pod scale up/down events. We also split these resources if the no.of endpoints (ingress or egress) exceed 1000 (similar to endpoint slices) in to multiple sub resources, so that we limit the data that needs to be transferred across to individual nodes during any related events. With all that being said, we've recently implemented an enhancement in the Network Policy controller (that runs in EKS control plane) to aggregate IP/Port info to further reduce the size of individual policyEndpoint resources and that change should automatically reflect on your cluster(s) in few weeks..

Martin Desrumaux · Answer 3 · Thu Feb 15 2024 18:52:47 GMT+0800 (China Standard Time)

@jayanthvn, we tested several instance types, especially the latest generation of network-optimized variants (c6in, m6in, r6in, m6idn, r6idn).

Using these instance types reduced the number of errors, but we still encountered many pps_allowance_exceeded errors with these instances.

@achevuru, thanks for the explanation! Our use-case involves running thousands of short-lived pods while using Karpenter to provide nodes. Although the creation/deletion of these short-lived pods should not update the network rules in the nodes (as they are not communicating with each other), it appears to generate traffic.

However, it's worth giving it another try after the Network Policy Controller fix is deployed!

Apurup Chevuru · Answer 4 · Tue Jun 04 2024 05:06:55 GMT+0800 (China Standard Time)

@gnuletik Network Policy controller fix is now patched on all the EKS clusters. If you're setup is still intact, can you check and let us know if you observe any change around network usage?

Martin Desrumaux · Answer 5 · Thu Jun 06 2024 01:20:34 GMT+0800 (China Standard Time)

Thanks for the feedback @achevuru!
I re-enabled the network policy agent two days ago and did not face pps_allowance_exceeded errors yet.
I'll let you know here if the issue comes back.
Thanks!

Apurup Chevuru · Answer 6 · Thu Jun 06 2024 01:23:24 GMT+0800 (China Standard Time)

Sure let us know how it goes over the next week or so.

Jayanth Varavani · Answer 7 · Wed Jun 12 2024 07:46:26 GMT+0800 (China Standard Time)

Closing the issue since not breaching PPS. Please feel free to re-open if the issue reoccurs..