Liveness and Readiness Probes are still blocked in v1.0.5

Question

Liveness and Readiness Probes are still blocked in v1.0.5

jinzishuai opened this issue 8 months ago · comments

What happened:
We have been trying to create EKS-v1.28 clusters with network policies. In order to get the networkpolicy feature, we have to use a AWS CNI plugin and in that process, we reproduced the same bug of #56 and aws/amazon-vpc-cni-k8s#2571. We were happy to read that the problems have been fixed in CNI-v1.15.1. However, it seems that the same problem is happening again in the latest version of CNI-v1.15.3 (network policy agent version 1.0.5)

We turned on policy event logs and this is part of output from kubectl describe ds -n kube-system aws-node

   aws-eks-nodeagent:
    Image:      602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.5-eksbuild.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --enable-ipv6=false
      --enable-network-policy=true
      --enable-cloudwatch-logs=false
      --enable-policy-event-logs=true
      --metrics-bind-addr=:8162
      --health-probe-bind-addr=:8163

Similar to aws/amazon-vpc-cni-k8s#2571, we use flux-v2 and it comes with its own network policies by default

╰─❯ kubectl -n flux-system get networkpolicies -o wide
NAME             POD-SELECTOR                  AGE
allow-egress     <none>                        17h
allow-scraping   <none>                        14h
allow-webhooks   app=notification-controller   15h

What happened is that we still see some deny events in the event log for probes from the kubelet itself.

Attach logs

This is observed in the /var/log/aws-routed-eni/network-policy-agent.log on one of the EKS worker nodes:

{"level":"info","ts":"2023-11-17T19:26:12.548Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"10.102.25.189","Src Port":44010,"Dest IP":"10.102.24.71","Dest Port":9440,"Proto":"TCP","Verdict":"DENY"}

Note that

10.102.25.189 is the IP of this EKS worker node
10.102.24.71: IP of the flux pod: helm-controller

What this means is that the kubelet's probe for this pod at http://10.102.24.71:9440/healthz is denied and this could lead to the restart of the pod. Unlike the case before the fix of v1.15.1, this no longer happens persistently, but rather intermittently. Still, we would see our pods getting restarted for no good reason from time to time.

What you expected to happen: I don't think any of the kubelet probe traffic should have been denied at all.

To be fair, I think this kind of denies happens a lot less frequently than what we used to see before the CNI-v1.15.1 (as reported in #56 and aws/amazon-vpc-cni-k8s#2571).

How to reproduce it (as minimally and precisely as possible): Create EKS-1.28 with Flux-2 that has network policy enabled by default. But I don't think flux-2 is the problem here (just as the old bug reported in aws/amazon-vpc-cni-k8s#2571)

Anything else we need to know?: NA

Environment:

Kubernetes version (use kubectl version): 1.27
CNI Version: 1.15.3 (the same error was reproduced in v1.15.1)
Network Policy Agent Version: 1.0.5
OS (e.g: cat /etc/os-release):

sh-4.2$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Kernel (e.g. uname -a):

sh-4.2$ uname -a
Linux ip-10-102-25-189 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Jayanth Varavani · Answer 1 · Sun Nov 19 2023 11:09:28 GMT+0800 (China Standard Time)

@jinzishuai - We are also repro'ing locally but so far no restarts seen since 6h.

Unlike the case before the fix of v1.15.1, this no longer happens persistently, but rather intermittently.

When you say intermittent after how long you noticed restart? Does even aws-node restart?

yurii-kryvosheia · Answer 2 · Mon Nov 20 2023 22:31:01 GMT+0800 (China Standard Time)

Same issue here

CNI Version: v1.15.4-eksbuild.1
Network Policy Agent Version: v1.0.6-eksbuild.1
Bottlerocket v1.6

Yevhenii Afanasiev · Answer 3 · Mon Nov 20 2023 22:32:32 GMT+0800 (China Standard Time)

Hi there! Seeing the same issue, environment is almost identical to @yurii-kryvosheia:

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"

Linux ip-10-40-1-157.ec2.internal 6.1.59 #1 SMP Thu Nov  9 08:11:39 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Shi Jin · Answer 4 · Tue Nov 21 2023 00:03:34 GMT+0800 (China Standard Time)

@jayanthvn the flux workloads get restarted many times over the weekend

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS          AGE
helm-controller-67f7b876cd-ws57l               1/1     Running   117 (3h48m ago)   2d20h
image-automation-controller-65887476b7-9qkp7   1/1     Running   54 (8h ago)       2d20h
image-reflector-controller-57847dc9cf-6qjgw    1/1     Running   1 (6h19m ago)     2d20h
kustomize-controller-6995fc8679-fv2tt          1/1     Running   333 (70m ago)     2d20h
notification-controller-5dbc9fc9c4-h8rkb       1/1     Running   61 (86m ago)      2d20h
source-controller-79fccc9df4-8gb29             1/1     Running   0                 2d20h

There is no restart of aws-node pods at all.

Apurup Chevuru · Answer 5 · Tue Nov 21 2023 01:45:58 GMT+0800 (China Standard Time)

@jinzishuai We couldn't reproduce this on our end. Flux workloads were up and running through the weekend. Was there any other event on the cluster? (i.e.,) pod/node scale up/down? or do you see this on a stable setup?

Jayanth Varavani · Answer 6 · Tue Nov 21 2023 01:46:53 GMT+0800 (China Standard Time)

@jinzishuai - This is on my cluster with 1.0.5 agent and no restarts seen. Can you email us the nodes log bundle. You collect node logs via /opt/cni/bin/aws-cni-support.sh and mail them to k8s-awscni-triage@amazon.com along with the describe o/p of policyEndpoint resources and configured Network Policies-

flux-system   helm-controller-57d8957947-7ltkg              1/1     Running   0               44h
flux-system   image-automation-controller-c84956fbd-2wb9t   1/1     Running   0               44h
flux-system   image-reflector-controller-86d47b689f-42nx5   1/1     Running   0               44h
flux-system   kustomize-controller-858996fc8d-xrk2w         1/1     Running   0               44h
flux-system   notification-controller-ddf44665d-h78kl       1/1     Running   0               44h
flux-system   source-controller-56ccbf8db8-bczps            1/1     Running   0               44h
kube-system   aws-node-cqbh2                                2/2     Running   0               44h
kube-system   aws-node-gxh2q                                2/2     Running   0               44h

Jayanth Varavani · Answer 7 · Tue Nov 21 2023 03:48:06 GMT+0800 (China Standard Time)

No issues seen on bottlerocket as well..I am having few tests still running and so far good. Once we get the logs and requested o/p we will review and get back...

flux-system   helm-controller-57d8957947-h7tmh              1/1     Running   0          79m
flux-system   image-automation-controller-c84956fbd-xdzvd   1/1     Running   0          79m
flux-system   image-reflector-controller-86d47b689f-dcd7q   1/1     Running   0          79m
flux-system   kustomize-controller-858996fc8d-kdk2b         1/1     Running   0          79m
flux-system   notification-controller-ddf44665d-p2z7p       1/1     Running   0          79m
flux-system   source-controller-56ccbf8db8-722vv            1/1     Running   0          79m

Shi Jin · Answer 8 · Tue Nov 21 2023 06:26:21 GMT+0800 (China Standard Time)

Can you email us the nodes log bundle. You collect node logs via /opt/cni/bin/aws-cni-support.sh and mail them to k8s-awscni-triage@amazon.com along with the describe o/p of policyEndpoint resources and configured Network Policies

Thanks. I've send that email.

Shi Jin · Answer 9 · Tue Nov 21 2023 06:29:32 GMT+0800 (China Standard Time)

@jayanthvn just to clarify: you are able to reproduce the DENY events but don't see the pod restart, right?
IMHO, the DENY should never have happened, regardless of whether it triggers restarts. Does that make sense?

Jayanth Varavani · Answer 10 · Tue Nov 21 2023 06:32:43 GMT+0800 (China Standard Time)

Oh should have clarified it, I am not seeing any Deny and no restarts too....

Shi Jin · Answer 11 · Tue Nov 21 2023 06:54:31 GMT+0800 (China Standard Time)

just in case: in order to see the deny events, you'd have to turn on the event log flag with --enable-policy-event-logs=true. I assume you did that @jayanthvn ?

Jayanth Varavani · Answer 12 · Tue Nov 21 2023 08:42:57 GMT+0800 (China Standard Time)

just in case: in order to see the deny events, you'd have to turn on the event log flag with --enable-policy-event-logs=true. I assume you did that @jayanthvn ?

Yes :)..we will review the logs and get back to you.

Jayanth Varavani · Answer 13 · Thu Nov 23 2023 03:19:50 GMT+0800 (China Standard Time)

Just a quick update, I was able to repro and noticed in few scale up/down scenario the map entry was getting overwritten and looks like dynamic map size increase or a synchronization issue is leading to this undefined behavior and unpredictable results. We have a possible work around and it is holding on our test cluster and will be continuing few more tests... If I can generate a RC image will you be able to test it on your cluster as well?

Shi Jin · Answer 14 · Thu Nov 23 2023 04:41:50 GMT+0800 (China Standard Time)

If I can generate a RC image will you be able to test it on your cluster as well?

@jayanthvn yes, I still have my test environment and will be able to test there.

Jayanth Varavani · Answer 15 · Thu Nov 23 2023 04:51:12 GMT+0800 (China Standard Time)

Thanks, I will keep the tests running and will share the release candidate(RC) image after the holidays.

Jayanth Varavani · Answer 16 · Sat Nov 25 2023 06:45:49 GMT+0800 (China Standard Time)

@jinzishuai - v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -

 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

Please let me know how the fix is holding up..

Shi Jin · Answer 17 · Sat Nov 25 2023 11:14:37 GMT+0800 (China Standard Time)

Please let me know how the fix is holding up..

thank you.
I've deployed the new images

  aws-eks-nodeagent:
    Image:      602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

and I've restarted all the pods in the flux-system namespace so that the restart counts are all freshly 0s.

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          2m21s
image-automation-controller-65887476b7-w95jw   1/1     Running   0          108m
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          2m17s
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          108m
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          2m14s
source-controller-79fccc9df4-6jbwp             1/1     Running   0          2m11s

I'll monitor this over the weekend and see if any restart happens.

Shi Jin · Answer 18 · Sat Nov 25 2023 22:55:00 GMT+0800 (China Standard Time)

so far so good

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          11h
image-automation-controller-65887476b7-w95jw   1/1     Running   0          13h
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          11h
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          13h
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          11h
source-controller-79fccc9df4-6jbwp             1/1     Running   0          11h

Shi Jin · Answer 19 · Mon Nov 27 2023 00:00:23 GMT+0800 (China Standard Time)

still looking good

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          36h
image-automation-controller-65887476b7-w95jw   1/1     Running   0          38h
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          36h
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          38h
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          36h
source-controller-79fccc9df4-6jbwp             1/1     Running   0          36h

Shi Jin · Answer 20 · Mon Nov 27 2023 23:54:46 GMT+0800 (China Standard Time)

all right. it worked well throughout the weekend

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          2d12h
image-automation-controller-65887476b7-w95jw   1/1     Running   0          2d14h
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          2d12h
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          2d14h
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          2d12h
source-controller-79fccc9df4-6jbwp             1/1     Running   0          2d12h

Jayanth Varavani · Answer 21 · Tue Nov 28 2023 02:20:14 GMT+0800 (China Standard Time)

Thanks for confirming @jinzishuai. Lets monitor for few more days and we will run few regression tests.

yurii-kryvosheia · Answer 22 · Tue Nov 28 2023 23:39:16 GMT+0800 (China Standard Time)

We have been running v1.0.7-rc1 for 24h with no issues too 👍🏼

Jeffrey Nelson · Answer 23 · Wed Dec 27 2023 00:28:24 GMT+0800 (China Standard Time)

Closing as fixed by v1.0.7

yurii-kryvosheia · Answer 24 · Wed Feb 07 2024 15:27:12 GMT+0800 (China Standard Time)

Really strange. Two of our clusters are still experiencing issues.
Version: v1.0.7-eksbuild.1

NAME                                           READY   STATUS    RESTARTS       AGE
helm-controller-6867c97684-kzc66               1/1     Running   22 (18h ago)   11d
image-automation-controller-596bbfdf57-r5rh7   1/1     Running   16 (98s ago)   11d
image-reflector-controller-5c9cb6d8b7-pn4vk    1/1     Running   3 (10h ago)    11d
kustomize-controller-7754fcdf86-tmfgl          1/1     Running   8 (4d2h ago)   11d
notification-controller-77f6d56594-jrdrr       1/1     Running   0              11d
source-controller-794ff95db-szzc7              1/1     Running   6 (9d ago)     11d

Jayanth Varavani · Answer 25 · Thu Feb 08 2024 00:59:11 GMT+0800 (China Standard Time)

@yurii-kryvosheia - Can you email us the nodes log bundle. You can collect node logs by running this script - /opt/cni/bin/aws-cni-support.sh on one of the nodes maybe the node with the pod - helm-controller-6867c97684-kzc66 and mail it to k8s-awscni-triage@amazon.com. We are releasing v1.0.8 with certain fixes so would like to review the logs. v1.0.8-rc3 tag is available if you would like to verify.

yurii-kryvosheia · Answer 26 · Thu Feb 08 2024 16:23:35 GMT+0800 (China Standard Time)

@jayanthvn we use Bottlerocket and it seems incompatible with that script. A lot of dependencies there.
I could generate a report, but it throws many errors. Is it worth sending such a report?

yurii-kryvosheia · Answer 27 · Fri May 17 2024 15:39:56 GMT+0800 (China Standard Time)

Today I've sent cni logs to the k8s-awscni-triage@amazon.com. I could duplicate it in this issue for posterity.
@jayanthvn