aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Apache Zookeeper: Periodic connection loss

xashr opened this issue · comments

What happened:
After migrating from Calico to Amazon VPC CNI Addon, we observed problems with Strimzi Kafka, more precisely with Apache Zookeeper.

Strimzi installs a network policy by default to allow communication between Zookeeper pods (and other Strimzi related pods).
So in a cluster with no network policies besides these default Zookeeper policy, the following happens in a namespace with 3 zookeeper replicas:

  • Leader election is successful, Zookeeper pods communicate without problems
  • After ~ 5 Minutes zookeeper pods lose connection
  • Connection is established again, (Leader election)
  • After ~ 5 Minutes zookeeper pods lose connection
  • ... (repeat)

Attach logs
In the Zookeeper logs the connection loss usually shows up like this:

[myid:2] - ERROR [LearnerHandler-/10.0.112.14:60242:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
  java.net.SocketTimeoutException: Read timed out  
...
[myid:2] - WARN  [LearnerHandler-/10.0.112.14:60242:LearnerHandler@737] - ******* GOODBYE /10.0.112.14:60242 ******** 

or

[myid:2] - ERROR [LearnerHandler-/10.0.114.144:42736:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
  java.io.EOFException
...
[myid:2] - WARN  [LearnerHandler-/10.0.114.144:42736:LearnerHandler@737] - ******* GOODBYE /10.0.114.144:42736 ********

What you expected to happen:
Established connections do not get dropped (periodically).

How to reproduce it (as minimally and precisely as possible):
We were able to make it easily reproducible by installing Zookeeper from Bitnami and applying a Strimzi-like network policy:

Install zookeeper with 3 replicas

helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install zookeepertest bitnami/zookeeper --version 10.2.5 --set replicaCount=3 --set logLevel=INFO

Install "Strimzi-like" network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-zookeepertest
spec:
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 2888
      protocol: TCP
    - port: 3888
      protocol: TCP
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 2181
      protocol: TCP
  - ports:
    - port: 9404
      protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: zookeepertest
  policyTypes:
  - Ingress
status: {}

Observe Zookeeper logs.

Anything else we need to know?:
Something interesting we observed:

  • The issue can be "fixed" by adding an additional network policy, allowing ingress to additional ports in the range ~ 40000-65000
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-policy
spec:
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 40000
      endPort: 65000
      protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: zookeepertest
  policyTypes:
  - Ingress
status: {}

Wild guess why this workaround works: The range 40000-65000 is the range of the source ports (as you can see in the logs above: GOODBYE /10.0.112.14:60242. Maybe there is a bug in the policy agent causing a loss of state after x minutes. After the state loss the communication to the source port 60242 is no longer known/accepted. With the workaround policy, however, the port range is explicitly allowed.

Environment:

  • AWS EKS, Kubernetes 1.26
  • CNI Version: amazon-k8s-cni:v1.15.4-eksbuild.1, aws-network-policy-agent:v1.0.6-eksbuild.1

@xashr - This is similar to #144. We have a fix for this and I can provide you a release candidate image if you are willing to try it out.

@jayanthvn - Is there an RC newer than v1.0.7-rc1 ? Sounds like #144 is not fixed yet according to Rez0k?

Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

@jayanthvn We are staying with Calico for now, but I ran a short test with the rc3 image in a separate cluster. The issue seems to be solved with that image. Thanks!

Thanks for trying out the image. Please feel free to reach out if you are having issues and we will be happy to help.