Liveness and Readiness Probes are still blocked in v1.0.5
jinzishuai opened this issue · comments
What happened:
We have been trying to create EKS-v1.28 clusters with network policies. In order to get the networkpolicy feature, we have to use a AWS CNI plugin and in that process, we reproduced the same bug of #56 and aws/amazon-vpc-cni-k8s#2571. We were happy to read that the problems have been fixed in CNI-v1.15.1. However, it seems that the same problem is happening again in the latest version of CNI-v1.15.3 (network policy agent version 1.0.5)
We turned on policy event logs and this is part of output from kubectl describe ds -n kube-system aws-node
aws-eks-nodeagent:
Image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.5-eksbuild.1
Port: <none>
Host Port: <none>
Args:
--enable-ipv6=false
--enable-network-policy=true
--enable-cloudwatch-logs=false
--enable-policy-event-logs=true
--metrics-bind-addr=:8162
--health-probe-bind-addr=:8163
Similar to aws/amazon-vpc-cni-k8s#2571, we use flux-v2 and it comes with its own network policies by default
╰─❯ kubectl -n flux-system get networkpolicies -o wide
NAME POD-SELECTOR AGE
allow-egress <none> 17h
allow-scraping <none> 14h
allow-webhooks app=notification-controller 15h
What happened is that we still see some deny events in the event log for probes from the kubelet itself.
Attach logs
This is observed in the /var/log/aws-routed-eni/network-policy-agent.log
on one of the EKS worker nodes:
{"level":"info","ts":"2023-11-17T19:26:12.548Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"10.102.25.189","Src Port":44010,"Dest IP":"10.102.24.71","Dest Port":9440,"Proto":"TCP","Verdict":"DENY"}
Note that
- 10.102.25.189 is the IP of this EKS worker node
- 10.102.24.71: IP of the flux pod:
helm-controller
What this means is that the kubelet's probe for this pod at http://10.102.24.71:9440/healthz is denied and this could lead to the restart of the pod. Unlike the case before the fix of v1.15.1, this no longer happens persistently, but rather intermittently. Still, we would see our pods getting restarted for no good reason from time to time.
What you expected to happen: I don't think any of the kubelet probe traffic should have been denied at all.
To be fair, I think this kind of denies happens a lot less frequently than what we used to see before the CNI-v1.15.1 (as reported in #56 and aws/amazon-vpc-cni-k8s#2571).
How to reproduce it (as minimally and precisely as possible): Create EKS-1.28 with Flux-2 that has network policy enabled by default. But I don't think flux-2 is the problem here (just as the old bug reported in aws/amazon-vpc-cni-k8s#2571)
Anything else we need to know?: NA
Environment:
- Kubernetes version (use
kubectl version
): 1.27 - CNI Version: 1.15.3 (the same error was reproduced in v1.15.1)
- Network Policy Agent Version: 1.0.5
- OS (e.g:
cat /etc/os-release
):
sh-4.2$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
- Kernel (e.g.
uname -a
):
sh-4.2$ uname -a
Linux ip-10-102-25-189 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
@jinzishuai - We are also repro'ing locally but so far no restarts seen since 6h.
Unlike the case before the fix of v1.15.1, this no longer happens persistently, but rather intermittently.
When you say intermittent after how long you noticed restart? Does even aws-node
restart?
Same issue here
- CNI Version: v1.15.4-eksbuild.1
- Network Policy Agent Version: v1.0.6-eksbuild.1
- Bottlerocket v1.6
Hi there! Seeing the same issue, environment is almost identical to @yurii-kryvosheia:
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Linux ip-10-40-1-157.ec2.internal 6.1.59 #1 SMP Thu Nov 9 08:11:39 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
@jayanthvn the flux workloads get restarted many times over the weekend
╰─❯ kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-67f7b876cd-ws57l 1/1 Running 117 (3h48m ago) 2d20h
image-automation-controller-65887476b7-9qkp7 1/1 Running 54 (8h ago) 2d20h
image-reflector-controller-57847dc9cf-6qjgw 1/1 Running 1 (6h19m ago) 2d20h
kustomize-controller-6995fc8679-fv2tt 1/1 Running 333 (70m ago) 2d20h
notification-controller-5dbc9fc9c4-h8rkb 1/1 Running 61 (86m ago) 2d20h
source-controller-79fccc9df4-8gb29 1/1 Running 0 2d20h
There is no restart of aws-node pods at all.
@jinzishuai We couldn't reproduce this on our end. Flux workloads were up and running through the weekend. Was there any other event on the cluster? (i.e.,) pod/node scale up/down? or do you see this on a stable setup?
@jinzishuai - This is on my cluster with 1.0.5 agent and no restarts seen. Can you email us the nodes log bundle. You collect node logs via /opt/cni/bin/aws-cni-support.sh
and mail them to k8s-awscni-triage@amazon.com
along with the describe o/p of policyEndpoint resources and configured Network Policies-
flux-system helm-controller-57d8957947-7ltkg 1/1 Running 0 44h
flux-system image-automation-controller-c84956fbd-2wb9t 1/1 Running 0 44h
flux-system image-reflector-controller-86d47b689f-42nx5 1/1 Running 0 44h
flux-system kustomize-controller-858996fc8d-xrk2w 1/1 Running 0 44h
flux-system notification-controller-ddf44665d-h78kl 1/1 Running 0 44h
flux-system source-controller-56ccbf8db8-bczps 1/1 Running 0 44h
kube-system aws-node-cqbh2 2/2 Running 0 44h
kube-system aws-node-gxh2q 2/2 Running 0 44h
No issues seen on bottlerocket as well..I am having few tests still running and so far good. Once we get the logs and requested o/p we will review and get back...
flux-system helm-controller-57d8957947-h7tmh 1/1 Running 0 79m
flux-system image-automation-controller-c84956fbd-xdzvd 1/1 Running 0 79m
flux-system image-reflector-controller-86d47b689f-dcd7q 1/1 Running 0 79m
flux-system kustomize-controller-858996fc8d-kdk2b 1/1 Running 0 79m
flux-system notification-controller-ddf44665d-p2z7p 1/1 Running 0 79m
flux-system source-controller-56ccbf8db8-722vv 1/1 Running 0 79m
Can you email us the nodes log bundle. You collect node logs via
/opt/cni/bin/aws-cni-support.sh
and mail them tok8s-awscni-triage@amazon.com
along with the describe o/p of policyEndpoint resources and configured Network Policies
Thanks. I've send that email.
@jayanthvn just to clarify: you are able to reproduce the DENY events but don't see the pod restart, right?
IMHO, the DENY should never have happened, regardless of whether it triggers restarts. Does that make sense?
Oh should have clarified it, I am not seeing any Deny and no restarts too....
just in case: in order to see the deny events, you'd have to turn on the event log flag with --enable-policy-event-logs=true
. I assume you did that @jayanthvn ?
just in case: in order to see the deny events, you'd have to turn on the event log flag with
--enable-policy-event-logs=true
. I assume you did that @jayanthvn ?
Yes :)..we will review the logs and get back to you.
Just a quick update, I was able to repro and noticed in few scale up/down scenario the map entry was getting overwritten and looks like dynamic map size increase or a synchronization issue is leading to this undefined behavior and unpredictable results. We have a possible work around and it is holding on our test cluster and will be continuing few more tests... If I can generate a RC image will you be able to test it on your cluster as well?
If I can generate a RC image will you be able to test it on your cluster as well?
@jayanthvn yes, I still have my test environment and will be able to test there.
Thanks, I will keep the tests running and will share the release candidate(RC) image after the holidays.
@jinzishuai - v1.0.7-rc1
tag is available. You can replace the aws-eks-nodeagent
container image on aws-node DS with the v1.0.7-rc1
tag
For example -
- name: aws-eks-nodeagent
image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1
Please let me know how the fix is holding up..
Please let me know how the fix is holding up..
thank you.
I've deployed the new images
aws-eks-nodeagent:
Image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1
and I've restarted all the pods in the flux-system
namespace so that the restart counts are all freshly 0s.
╰─❯ kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-67f7b876cd-7dff4 1/1 Running 0 2m21s
image-automation-controller-65887476b7-w95jw 1/1 Running 0 108m
image-reflector-controller-57847dc9cf-5hbr8 1/1 Running 0 2m17s
kustomize-controller-6995fc8679-qwl6n 1/1 Running 0 108m
notification-controller-5dbc9fc9c4-bsz26 1/1 Running 0 2m14s
source-controller-79fccc9df4-6jbwp 1/1 Running 0 2m11s
I'll monitor this over the weekend and see if any restart happens.
so far so good
╰─❯ kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-67f7b876cd-7dff4 1/1 Running 0 11h
image-automation-controller-65887476b7-w95jw 1/1 Running 0 13h
image-reflector-controller-57847dc9cf-5hbr8 1/1 Running 0 11h
kustomize-controller-6995fc8679-qwl6n 1/1 Running 0 13h
notification-controller-5dbc9fc9c4-bsz26 1/1 Running 0 11h
source-controller-79fccc9df4-6jbwp 1/1 Running 0 11h
still looking good
╰─❯ kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-67f7b876cd-7dff4 1/1 Running 0 36h
image-automation-controller-65887476b7-w95jw 1/1 Running 0 38h
image-reflector-controller-57847dc9cf-5hbr8 1/1 Running 0 36h
kustomize-controller-6995fc8679-qwl6n 1/1 Running 0 38h
notification-controller-5dbc9fc9c4-bsz26 1/1 Running 0 36h
source-controller-79fccc9df4-6jbwp 1/1 Running 0 36h
all right. it worked well throughout the weekend
╰─❯ kubectl get pods -n flux-system
NAME READY STATUS RESTARTS AGE
helm-controller-67f7b876cd-7dff4 1/1 Running 0 2d12h
image-automation-controller-65887476b7-w95jw 1/1 Running 0 2d14h
image-reflector-controller-57847dc9cf-5hbr8 1/1 Running 0 2d12h
kustomize-controller-6995fc8679-qwl6n 1/1 Running 0 2d14h
notification-controller-5dbc9fc9c4-bsz26 1/1 Running 0 2d12h
source-controller-79fccc9df4-6jbwp 1/1 Running 0 2d12h
Thanks for confirming @jinzishuai. Lets monitor for few more days and we will run few regression tests.
We have been running v1.0.7-rc1 for 24h with no issues too 👍🏼
Closing as fixed by v1.0.7
Really strange. Two of our clusters are still experiencing issues.
Version: v1.0.7-eksbuild.1
NAME READY STATUS RESTARTS AGE
helm-controller-6867c97684-kzc66 1/1 Running 22 (18h ago) 11d
image-automation-controller-596bbfdf57-r5rh7 1/1 Running 16 (98s ago) 11d
image-reflector-controller-5c9cb6d8b7-pn4vk 1/1 Running 3 (10h ago) 11d
kustomize-controller-7754fcdf86-tmfgl 1/1 Running 8 (4d2h ago) 11d
notification-controller-77f6d56594-jrdrr 1/1 Running 0 11d
source-controller-794ff95db-szzc7 1/1 Running 6 (9d ago) 11d
@yurii-kryvosheia - Can you email us the nodes log bundle. You can collect node logs by running this script - /opt/cni/bin/aws-cni-support.sh
on one of the nodes maybe the node with the pod - helm-controller-6867c97684-kzc66
and mail it to k8s-awscni-triage@amazon.com
. We are releasing v1.0.8 with certain fixes so would like to review the logs. v1.0.8-rc3
tag is available if you would like to verify.
@jayanthvn we use Bottlerocket and it seems incompatible with that script. A lot of dependencies there.
I could generate a report, but it throws many errors. Is it worth sending such a report?
Today I've sent cni logs to the k8s-awscni-triage@amazon.com
. I could duplicate it in this issue for posterity.
@jayanthvn