Apache Zookeeper: Periodic connection loss
xashr opened this issue · comments
What happened:
After migrating from Calico to Amazon VPC CNI Addon, we observed problems with Strimzi Kafka, more precisely with Apache Zookeeper.
Strimzi installs a network policy by default to allow communication between Zookeeper pods (and other Strimzi related pods).
So in a cluster with no network policies besides these default Zookeeper policy, the following happens in a namespace with 3 zookeeper replicas:
- Leader election is successful, Zookeeper pods communicate without problems
- After ~ 5 Minutes zookeeper pods lose connection
- Connection is established again, (Leader election)
- After ~ 5 Minutes zookeeper pods lose connection
- ... (repeat)
Attach logs
In the Zookeeper logs the connection loss usually shows up like this:
[myid:2] - ERROR [LearnerHandler-/10.0.112.14:60242:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
java.net.SocketTimeoutException: Read timed out
...
[myid:2] - WARN [LearnerHandler-/10.0.112.14:60242:LearnerHandler@737] - ******* GOODBYE /10.0.112.14:60242 ********
or
[myid:2] - ERROR [LearnerHandler-/10.0.114.144:42736:LearnerHandler@714] - Unexpected exception causing shutdown while sock still open
java.io.EOFException
...
[myid:2] - WARN [LearnerHandler-/10.0.114.144:42736:LearnerHandler@737] - ******* GOODBYE /10.0.114.144:42736 ********
What you expected to happen:
Established connections do not get dropped (periodically).
How to reproduce it (as minimally and precisely as possible):
We were able to make it easily reproducible by installing Zookeeper from Bitnami and applying a Strimzi-like network policy:
Install zookeeper with 3 replicas
helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install zookeepertest bitnami/zookeeper --version 10.2.5 --set replicaCount=3 --set logLevel=INFO
Install "Strimzi-like" network policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: netpol-zookeepertest
spec:
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/instance: zookeepertest
ports:
- port: 2888
protocol: TCP
- port: 3888
protocol: TCP
- from:
- podSelector:
matchLabels:
app.kubernetes.io/instance: zookeepertest
ports:
- port: 2181
protocol: TCP
- ports:
- port: 9404
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/instance: zookeepertest
policyTypes:
- Ingress
status: {}
Observe Zookeeper logs.
Anything else we need to know?:
Something interesting we observed:
- The issue can be "fixed" by adding an additional network policy, allowing ingress to additional ports in the range ~ 40000-65000
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-policy
spec:
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/instance: zookeepertest
ports:
- port: 40000
endPort: 65000
protocol: TCP
podSelector:
matchLabels:
app.kubernetes.io/instance: zookeepertest
policyTypes:
- Ingress
status: {}
Wild guess why this workaround works: The range 40000-65000 is the range of the source ports (as you can see in the logs above: GOODBYE /10.0.112.14:60242
. Maybe there is a bug in the policy agent causing a loss of state after x minutes. After the state loss the communication to the source port 60242 is no longer known/accepted. With the workaround policy, however, the port range is explicitly allowed.
Environment:
- AWS EKS, Kubernetes 1.26
- CNI Version: amazon-k8s-cni:v1.15.4-eksbuild.1, aws-network-policy-agent:v1.0.6-eksbuild.1
@jayanthvn - Is there an RC newer than v1.0.7-rc1 ? Sounds like #144 is not fixed yet according to Rez0k?
Will you be able to try this image -
<account-number>.dkr.ecr.<region>.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc3
Please make sure you replace the account number and region.
@jayanthvn We are staying with Calico for now, but I ran a short test with the rc3 image in a separate cluster. The issue seems to be solved with that image. Thanks!
Thanks for trying out the image. Please feel free to reach out if you are having issues and we will be happy to help.