aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Liveness and Readiness Probes are still blocked in v1.0.5

jinzishuai opened this issue · comments

What happened:
We have been trying to create EKS-v1.28 clusters with network policies. In order to get the networkpolicy feature, we have to use a AWS CNI plugin and in that process, we reproduced the same bug of #56 and aws/amazon-vpc-cni-k8s#2571. We were happy to read that the problems have been fixed in CNI-v1.15.1. However, it seems that the same problem is happening again in the latest version of CNI-v1.15.3 (network policy agent version 1.0.5)

We turned on policy event logs and this is part of output from kubectl describe ds -n kube-system aws-node

   aws-eks-nodeagent:
    Image:      602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.5-eksbuild.1
    Port:       <none>
    Host Port:  <none>
    Args:
      --enable-ipv6=false
      --enable-network-policy=true
      --enable-cloudwatch-logs=false
      --enable-policy-event-logs=true
      --metrics-bind-addr=:8162
      --health-probe-bind-addr=:8163

Similar to aws/amazon-vpc-cni-k8s#2571, we use flux-v2 and it comes with its own network policies by default

╰─❯ kubectl -n flux-system get networkpolicies -o wide
NAME             POD-SELECTOR                  AGE
allow-egress     <none>                        17h
allow-scraping   <none>                        14h
allow-webhooks   app=notification-controller   15h

What happened is that we still see some deny events in the event log for probes from the kubelet itself.

Attach logs

This is observed in the /var/log/aws-routed-eni/network-policy-agent.log on one of the EKS worker nodes:

{"level":"info","ts":"2023-11-17T19:26:12.548Z","logger":"ebpf-client","msg":"Flow Info: ","Src IP":"10.102.25.189","Src Port":44010,"Dest IP":"10.102.24.71","Dest Port":9440,"Proto":"TCP","Verdict":"DENY"}

Note that

  • 10.102.25.189 is the IP of this EKS worker node
  • 10.102.24.71: IP of the flux pod: helm-controller

What this means is that the kubelet's probe for this pod at http://10.102.24.71:9440/healthz is denied and this could lead to the restart of the pod. Unlike the case before the fix of v1.15.1, this no longer happens persistently, but rather intermittently. Still, we would see our pods getting restarted for no good reason from time to time.

What you expected to happen: I don't think any of the kubelet probe traffic should have been denied at all.

To be fair, I think this kind of denies happens a lot less frequently than what we used to see before the CNI-v1.15.1 (as reported in #56 and aws/amazon-vpc-cni-k8s#2571).

How to reproduce it (as minimally and precisely as possible): Create EKS-1.28 with Flux-2 that has network policy enabled by default. But I don't think flux-2 is the problem here (just as the old bug reported in aws/amazon-vpc-cni-k8s#2571)

Anything else we need to know?: NA

Environment:

  • Kubernetes version (use kubectl version): 1.27
  • CNI Version: 1.15.3 (the same error was reproduced in v1.15.1)
  • Network Policy Agent Version: 1.0.5
  • OS (e.g: cat /etc/os-release):
sh-4.2$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a):
sh-4.2$ uname -a
Linux ip-10-102-25-189 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@jinzishuai - We are also repro'ing locally but so far no restarts seen since 6h.

Unlike the case before the fix of v1.15.1, this no longer happens persistently, but rather intermittently.

When you say intermittent after how long you noticed restart? Does even aws-node restart?

Same issue here

  • CNI Version: v1.15.4-eksbuild.1
  • Network Policy Agent Version: v1.0.6-eksbuild.1
  • Bottlerocket v1.6

Hi there! Seeing the same issue, environment is almost identical to @yurii-kryvosheia:

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Linux ip-10-40-1-157.ec2.internal 6.1.59 #1 SMP Thu Nov  9 08:11:39 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

@jayanthvn the flux workloads get restarted many times over the weekend

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS          AGE
helm-controller-67f7b876cd-ws57l               1/1     Running   117 (3h48m ago)   2d20h
image-automation-controller-65887476b7-9qkp7   1/1     Running   54 (8h ago)       2d20h
image-reflector-controller-57847dc9cf-6qjgw    1/1     Running   1 (6h19m ago)     2d20h
kustomize-controller-6995fc8679-fv2tt          1/1     Running   333 (70m ago)     2d20h
notification-controller-5dbc9fc9c4-h8rkb       1/1     Running   61 (86m ago)      2d20h
source-controller-79fccc9df4-8gb29             1/1     Running   0                 2d20h

There is no restart of aws-node pods at all.

@jinzishuai We couldn't reproduce this on our end. Flux workloads were up and running through the weekend. Was there any other event on the cluster? (i.e.,) pod/node scale up/down? or do you see this on a stable setup?

@jinzishuai - This is on my cluster with 1.0.5 agent and no restarts seen. Can you email us the nodes log bundle. You collect node logs via /opt/cni/bin/aws-cni-support.sh and mail them to k8s-awscni-triage@amazon.com along with the describe o/p of policyEndpoint resources and configured Network Policies-

flux-system   helm-controller-57d8957947-7ltkg              1/1     Running   0               44h
flux-system   image-automation-controller-c84956fbd-2wb9t   1/1     Running   0               44h
flux-system   image-reflector-controller-86d47b689f-42nx5   1/1     Running   0               44h
flux-system   kustomize-controller-858996fc8d-xrk2w         1/1     Running   0               44h
flux-system   notification-controller-ddf44665d-h78kl       1/1     Running   0               44h
flux-system   source-controller-56ccbf8db8-bczps            1/1     Running   0               44h
kube-system   aws-node-cqbh2                                2/2     Running   0               44h
kube-system   aws-node-gxh2q                                2/2     Running   0               44h

No issues seen on bottlerocket as well..I am having few tests still running and so far good. Once we get the logs and requested o/p we will review and get back...

flux-system   helm-controller-57d8957947-h7tmh              1/1     Running   0          79m
flux-system   image-automation-controller-c84956fbd-xdzvd   1/1     Running   0          79m
flux-system   image-reflector-controller-86d47b689f-dcd7q   1/1     Running   0          79m
flux-system   kustomize-controller-858996fc8d-kdk2b         1/1     Running   0          79m
flux-system   notification-controller-ddf44665d-p2z7p       1/1     Running   0          79m
flux-system   source-controller-56ccbf8db8-722vv            1/1     Running   0          79m

Can you email us the nodes log bundle. You collect node logs via /opt/cni/bin/aws-cni-support.sh and mail them to k8s-awscni-triage@amazon.com along with the describe o/p of policyEndpoint resources and configured Network Policies

Thanks. I've send that email.

@jayanthvn just to clarify: you are able to reproduce the DENY events but don't see the pod restart, right?
IMHO, the DENY should never have happened, regardless of whether it triggers restarts. Does that make sense?

Oh should have clarified it, I am not seeing any Deny and no restarts too....

just in case: in order to see the deny events, you'd have to turn on the event log flag with --enable-policy-event-logs=true. I assume you did that @jayanthvn ?

just in case: in order to see the deny events, you'd have to turn on the event log flag with --enable-policy-event-logs=true. I assume you did that @jayanthvn ?

Yes :)..we will review the logs and get back to you.

Just a quick update, I was able to repro and noticed in few scale up/down scenario the map entry was getting overwritten and looks like dynamic map size increase or a synchronization issue is leading to this undefined behavior and unpredictable results. We have a possible work around and it is holding on our test cluster and will be continuing few more tests... If I can generate a RC image will you be able to test it on your cluster as well?

If I can generate a RC image will you be able to test it on your cluster as well?

@jayanthvn yes, I still have my test environment and will be able to test there.

Thanks, I will keep the tests running and will share the release candidate(RC) image after the holidays.

@jinzishuai - v1.0.7-rc1 tag is available. You can replace the aws-eks-nodeagent container image on aws-node DS with the v1.0.7-rc1 tag

For example -

 - name: aws-eks-nodeagent
    image: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

Please let me know how the fix is holding up..

Please let me know how the fix is holding up..

thank you.
I've deployed the new images

  aws-eks-nodeagent:
    Image:      602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-rc1

and I've restarted all the pods in the flux-system namespace so that the restart counts are all freshly 0s.

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          2m21s
image-automation-controller-65887476b7-w95jw   1/1     Running   0          108m
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          2m17s
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          108m
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          2m14s
source-controller-79fccc9df4-6jbwp             1/1     Running   0          2m11s

I'll monitor this over the weekend and see if any restart happens.

so far so good

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          11h
image-automation-controller-65887476b7-w95jw   1/1     Running   0          13h
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          11h
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          13h
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          11h
source-controller-79fccc9df4-6jbwp             1/1     Running   0          11h

still looking good

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          36h
image-automation-controller-65887476b7-w95jw   1/1     Running   0          38h
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          36h
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          38h
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          36h
source-controller-79fccc9df4-6jbwp             1/1     Running   0          36h

all right. it worked well throughout the weekend

╰─❯ kubectl get pods -n flux-system
NAME                                           READY   STATUS    RESTARTS   AGE
helm-controller-67f7b876cd-7dff4               1/1     Running   0          2d12h
image-automation-controller-65887476b7-w95jw   1/1     Running   0          2d14h
image-reflector-controller-57847dc9cf-5hbr8    1/1     Running   0          2d12h
kustomize-controller-6995fc8679-qwl6n          1/1     Running   0          2d14h
notification-controller-5dbc9fc9c4-bsz26       1/1     Running   0          2d12h
source-controller-79fccc9df4-6jbwp             1/1     Running   0          2d12h

Thanks for confirming @jinzishuai. Lets monitor for few more days and we will run few regression tests.

We have been running v1.0.7-rc1 for 24h with no issues too 👍🏼

Closing as fixed by v1.0.7

Really strange. Two of our clusters are still experiencing issues.
Version: v1.0.7-eksbuild.1

NAME                                           READY   STATUS    RESTARTS       AGE
helm-controller-6867c97684-kzc66               1/1     Running   22 (18h ago)   11d
image-automation-controller-596bbfdf57-r5rh7   1/1     Running   16 (98s ago)   11d
image-reflector-controller-5c9cb6d8b7-pn4vk    1/1     Running   3 (10h ago)    11d
kustomize-controller-7754fcdf86-tmfgl          1/1     Running   8 (4d2h ago)   11d
notification-controller-77f6d56594-jrdrr       1/1     Running   0              11d
source-controller-794ff95db-szzc7              1/1     Running   6 (9d ago)     11d

@yurii-kryvosheia - Can you email us the nodes log bundle. You can collect node logs by running this script - /opt/cni/bin/aws-cni-support.sh on one of the nodes maybe the node with the pod - helm-controller-6867c97684-kzc66 and mail it to k8s-awscni-triage@amazon.com. We are releasing v1.0.8 with certain fixes so would like to review the logs. v1.0.8-rc3 tag is available if you would like to verify.

@jayanthvn we use Bottlerocket and it seems incompatible with that script. A lot of dependencies there.
I could generate a report, but it throws many errors. Is it worth sending such a report?

Today I've sent cni logs to the k8s-awscni-triage@amazon.com. I could duplicate it in this issue for posterity.
@jayanthvn