`amd64` binary wrongly copy into `arm64` image, causing Pods fall into `CrashLoopBackoff` state

Question

`amd64` binary wrongly copy into `arm64` image, causing Pods fall into `CrashLoopBackoff` state

guessi opened this issue 3 months ago · comments

guessi commented 3 months ago

What happened:

Seeing asm_amd64.s shown in arm64 image, which should be asm_arm64.s

Attach logs
n/a

What you expected to happen:
Pods on Graviton nodes should be RUNNING but not CrashLoopBackoff

How to reproduce it (as minimally and precisely as possible):

Start agent with --enable-policy-event-logs=true set.
Seeing CrashLoopBackoff and identify it was run on Graviton node

$ kubectl -n kube-system get pods -l k8s-app=k8s-node
kube-system   aws-node-qxld8             1/2     CrashLoopBackOff   6 (4m4s ago) ...

Log into node and check for image

# /usr/local/bin/nerdctl -n k8s.io image inspect 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-eksbuild.1 | grep 'Architecture'
    "Architecture": "arm64", # <----------- I can see the image is "arm64".

Check for error log for aws-eks-nodeagent

$ kubectl -n kube-system logs -f aws-node-qxld8 -c aws-eks-nodeagent
{"level":"info","ts":"2024-04-07T08:13:07.999Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
                                                                  ^^^^^^^^^ But here, is it normal to see "AMD64" here?

Possibly be missing cross-arch build file copy in Dockerfile or Makefile

Removing --enable-policy-event-logs=true or set as false then Pods should back to RUNNING state.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
CNI Version
Network Policy Agent Version
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):

guessi · Answer 1 · Sun Apr 07 2024 18:20:07 GMT+0800 (China Standard Time)

fyi, it was originally showing asm_arm64.s but for some reason, it just break!

Where you can see the error log from #135 showing asm_arm64.s in log lines.

{"level":"info","ts":"2023-11-09T12:32:26.065Z","caller":"runtime/asm_arm64.s:1197","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}

Apurup Chevuru · Answer 2 · Mon Apr 08 2024 02:23:25 GMT+0800 (China Standard Time)

@guessi Are you saying the crash only happens if you set enable-policy-event-logs to true? If the release is incorrectly using an amd image on arm nodes, it should fail always and shouldn't be tied to one of the custom env variables.

Are you seeing this behavior with the latest VPC CNI version? Was it working fine with prior releases on your setup?

guessi · Answer 3 · Mon Apr 08 2024 08:21:43 GMT+0800 (China Standard Time)

@achevuru maybe I should address more details

TL;DR

It doesn't matter if the flag is set or not, it's more about where it run, what the arch for the node.

Version is not the key to the issue.
It's about how the aws-eks-nodeagent image build.
With flag set, running with x86_64 nodes, Pods could turn to RUNNING state with no issue.
With flag set, running with arm64 nodes, Pods will always stuck in CrashLoopBackoff state.

Full story

Tested with the following combinations

Amazon EKS 1.25 (Platform version: eks.15, eks.17)
- eks.17 is the latest platform version of Amazon EKS 1.25 and it should met the minimum requirements state HERE.
AMI: retrieve from SSM (AL2) - ref: HERE.
Amazon VPC CNI v1.15.1-eksbuild.1, v1.15.5-eksbuild.1, v1.16.4-eksbuild.2, ...
- It was originally run with v1.15.1-eksbuild.1 with no flag set, trying to upgrade one minor version at a time.
- Tested all latest version of v1.15.x, v1.16.x, v1.17.x, v1.18.x, all the same, when there's no flag, everything works fine.
- With flag set, only x86_64 node could successfully spawned, but all arm64 node failed.
Configuration passed to Managed Addon (vpc-cni)

{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}

Instance Families I used for testing
- x86_64: t3a
- arm64: t4g
No CNINode defined yet.
No SecurityGroupPolicy defined yet.
No NetworkPolicy defined yet.

Just follow the doc with Graviton node running, and you should know what I said,

https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html

I believe you should easily to reproduce CrashLoopBackoff loop for arm64 nodes.

The message I provided was the minimum setup to reproduce the issue.

Apurup Chevuru · Answer 4 · Mon Apr 08 2024 10:07:19 GMT+0800 (China Standard Time)

@guessi Understood, but my Q was more on the below statements from you..

- Tested all latest version of v1.15.x, v1.16.x, v1.17.x, v1.18.x, all the same, when there's no flag, everything works fine.
- With flag set, running with arm64 nodes, Pods will always stuck in CrashLoopBackoff state.

So, it appears NP agent is working fine for you if enable-policy-event-logs flag is not set even on Graviton instances. If true, then this should not be tied to an incorrect arch binary used on arm nodes. Flag you're setting is just to enable logs and nothing to do with Network Policy functionality.

Anyways, we will also try it and let you know.

Apurup Chevuru · Answer 5 · Mon Apr 08 2024 10:36:10 GMT+0800 (China Standard Time)

Synced up internally with @guessi and the above issue is due to missing Cloudwatch perms as the cluster also had enable-cloudwatch-logs set. Issue resolved itself once the relevant permissions were provided. We will look for better ways to expose the error message to the end user. Right now, NP agent logs will show 403s against Cloudwatch APIs

guessi · Answer 6 · Mon Apr 08 2024 10:37:58 GMT+0800 (China Standard Time)

@achevuru Thanks for update, I could now narrow down the issue to the difference between the setup below,

The working one:

{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true"}}`

Not working one:

{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}

Further deep dive into the issue, I found it's IAM policy setup issue

https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html#network-policies-cloudwatchlogs

After adding missing IAM Policies, everything works as expected.

Post-incident suggestions

By following the guidance HERE, the IAM Policy setup in the doc is now "after" the step to enable enableCloudWatchLogs but not "before" ( It should be mentioned before it is enabled ). It's really hard to identify the issue with no log emit.