Apache Zookeeper: Periodic connection loss

Question

Apache Zookeeper: Periodic connection loss

xashr opened this issue 7 months ago · comments

What happened:
After migrating from Calico to Amazon VPC CNI Addon, we observed problems with Strimzi Kafka, more precisely with Apache Zookeeper.

Strimzi installs a network policy by default to allow communication between Zookeeper pods (and other Strimzi related pods).
So in a cluster with no network policies besides these default Zookeeper policy, the following happens in a namespace with 3 zookeeper replicas:

Leader election is successful, Zookeeper pods communicate without problems
After ~ 5 Minutes zookeeper pods lose connection
Connection is established again, (Leader election)
After ~ 5 Minutes zookeeper pods lose connection
... (repeat)

Attach logs
In the Zookeeper logs the connection loss usually shows up like this:

[myid:2] - ERROR [LearnerHandler-/10.0.112.14:60242:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
  java.net.SocketTimeoutException: Read timed out  
...
[myid:2] - WARN  [LearnerHandler-/10.0.112.14:60242:LearnerHandler@737] - ******* GOODBYE /10.0.112.14:60242 ********

or

[myid:2] - ERROR [LearnerHandler-/10.0.114.144:42736:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
  java.io.EOFException
...
[myid:2] - WARN  [LearnerHandler-/10.0.114.144:42736:LearnerHandler@737] - ******* GOODBYE /10.0.114.144:42736 ********

What you expected to happen:
Established connections do not get dropped (periodically).

How to reproduce it (as minimally and precisely as possible):
We were able to make it easily reproducible by installing Zookeeper from Bitnami and applying a Strimzi-like network policy:

Install zookeeper with 3 replicas

helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install zookeepertest bitnami/zookeeper --version 10.2.5 --set replicaCount=3 --set logLevel=INFO

Install "Strimzi-like" network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: netpol-zookeepertest
spec:
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 2888
      protocol: TCP
    - port: 3888
      protocol: TCP
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 2181
      protocol: TCP
  - ports:
    - port: 9404
      protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: zookeepertest
  policyTypes:
  - Ingress
status: {}

Observe Zookeeper logs.

Anything else we need to know?:
Something interesting we observed:

The issue can be "fixed" by adding an additional network policy, allowing ingress to additional ports in the range ~ 40000-65000

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-policy
spec:
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app.kubernetes.io/instance: zookeepertest
    ports:
    - port: 40000
      endPort: 65000
      protocol: TCP
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: zookeepertest
  policyTypes:
  - Ingress
status: {}

Wild guess why this workaround works: The range 40000-65000 is the range of the source ports (as you can see in the logs above: GOODBYE /10.0.112.14:60242. Maybe there is a bug in the policy agent causing a loss of state after x minutes. After the state loss the communication to the source port 60242 is no longer known/accepted. With the workaround policy, however, the port range is explicitly allowed.

Environment:

AWS EKS, Kubernetes 1.26
CNI Version: amazon-k8s-cni:v1.15.4-eksbuild.1, aws-network-policy-agent:v1.0.6-eksbuild.1

Jayanth Varavani · Answer 1 · Wed Dec 06 2023 02:08:18 GMT+0800 (China Standard Time)

@xashr - This is similar to #144. We have a fix for this and I can provide you a release candidate image if you are willing to try it out.

xashr · Answer 2 · Wed Dec 06 2023 16:47:46 GMT+0800 (China Standard Time)

@jayanthvn - Is there an RC newer than v1.0.7-rc1 ? Sounds like #144 is not fixed yet according to Rez0k?

Jayanth Varavani · Answer 3 · Thu Dec 07 2023 07:08:12 GMT+0800 (China Standard Time)

Will you be able to try this image -

<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3

Please make sure you replace the account number and region.

xashr · Answer 4 · Fri Dec 15 2023 00:53:57 GMT+0800 (China Standard Time)

@jayanthvn We are staying with Calico for now, but I ran a short test with the rc3 image in a separate cluster. The issue seems to be solved with that image. Thanks!

Jayanth Varavani · Answer 5 · Fri Dec 15 2023 03:08:49 GMT+0800 (China Standard Time)

Thanks for trying out the image. Please feel free to reach out if you are having issues and we will be happy to help.