`amd64` binary wrongly copy into `arm64` image, causing Pods fall into `CrashLoopBackoff` state
guessi opened this issue · comments
What happened:
Seeing asm_amd64.s
shown in arm64
image, which should be asm_arm64.s
Attach logs
n/a
What you expected to happen:
Pods on Graviton nodes should be RUNNING
but not CrashLoopBackoff
How to reproduce it (as minimally and precisely as possible):
-
Start agent with
--enable-policy-event-logs=true
set. -
Seeing
CrashLoopBackoff
and identify it was run on Graviton node
$ kubectl -n kube-system get pods -l k8s-app=k8s-node
kube-system aws-node-qxld8 1/2 CrashLoopBackOff 6 (4m4s ago) ...
- Log into node and check for image
# /usr/local/bin/nerdctl -n k8s.io image inspect 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-network-policy-agent:v1.0.7-eksbuild.1 | grep 'Architecture'
"Architecture": "arm64", # <----------- I can see the image is "arm64".
- Check for error log for
aws-eks-nodeagent
$ kubectl -n kube-system logs -f aws-node-qxld8 -c aws-eks-nodeagent
{"level":"info","ts":"2024-04-07T08:13:07.999Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
^^^^^^^^^ But here, is it normal to see "AMD64" here?
- Possibly be missing cross-arch build file copy in Dockerfile or Makefile
- https://github.com/aws/aws-network-policy-agent/blob/main/Dockerfile
- https://github.com/aws/aws-network-policy-agent/blob/main/Makefile#L185-L206
- Removing
--enable-policy-event-logs=true
or set asfalse
then Pods should back toRUNNING
state.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): - CNI Version
- Network Policy Agent Version
- OS (e.g:
cat /etc/os-release
): - Kernel (e.g.
uname -a
):
fyi, it was originally showing asm_arm64.s
but for some reason, it just break!
Where you can see the error log from #135 showing asm_arm64.s
in log lines.
{"level":"info","ts":"2023-11-09T12:32:26.065Z","caller":"runtime/asm_arm64.s:1197","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
@guessi Are you saying the crash only happens if you set enable-policy-event-logs
to true
? If the release is incorrectly using an amd image on arm nodes, it should fail always and shouldn't be tied to one of the custom env variables.
Are you seeing this behavior with the latest VPC CNI version? Was it working fine with prior releases on your setup?
@achevuru maybe I should address more details
TL;DR
It doesn't matter if the flag is set or not, it's more about where it run, what the arch for the node
.
- Version is not the key to the issue.
- It's about how the
aws-eks-nodeagent
image build. - With flag set, running with
x86_64
nodes, Pods could turn toRUNNING
state with no issue. - With flag set, running with
arm64
nodes, Pods will always stuck inCrashLoopBackoff
state.
Full story
Tested with the following combinations
- Amazon EKS 1.25 (Platform version:
eks.15
,eks.17
)eks.17
is the latest platform version of Amazon EKS 1.25 and it should met the minimum requirements state HERE.
- AMI: retrieve from SSM (AL2) - ref: HERE.
- Amazon VPC CNI
v1.15.1-eksbuild.1
,v1.15.5-eksbuild.1
,v1.16.4-eksbuild.2
, ...- It was originally run with
v1.15.1-eksbuild.1
with no flag set, trying to upgrade one minor version at a time. - Tested all latest version of v1.15.x, v1.16.x, v1.17.x, v1.18.x, all the same, when there's no flag, everything works fine.
- With flag set, only
x86_64
node could successfully spawned, but allarm64
node failed.
- It was originally run with
- Configuration passed to Managed Addon (vpc-cni)
{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}
- Instance Families I used for testing
- x86_64:
t3a
- arm64:
t4g
- x86_64:
- No
CNINode
defined yet. - No
SecurityGroupPolicy
defined yet. - No
NetworkPolicy
defined yet.
Just follow the doc with Graviton node running, and you should know what I said,
I believe you should easily to reproduce CrashLoopBackoff
loop for arm64
nodes.
The message I provided was the minimum setup to reproduce the issue.
@guessi Understood, but my Q was more on the below statements from you..
- Tested all latest version of v1.15.x, v1.16.x, v1.17.x, v1.18.x, all the same, when there's no flag, everything works fine.
- With flag set, running with arm64 nodes, Pods will always stuck in CrashLoopBackoff state.
So, it appears NP agent is working fine for you if enable-policy-event-logs
flag is not set even on Graviton instances. If true, then this should not be tied to an incorrect arch binary used on arm nodes. Flag you're setting is just to enable logs and nothing to do with Network Policy functionality.
Anyways, we will also try it and let you know.
Synced up internally with @guessi and the above issue is due to missing Cloudwatch perms as the cluster also had enable-cloudwatch-logs
set. Issue resolved itself once the relevant permissions were provided. We will look for better ways to expose the error message to the end user. Right now, NP agent logs will show 403s against Cloudwatch APIs
@achevuru Thanks for update, I could now narrow down the issue to the difference between the setup below,
The working one:
{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true"}}`
Not working one:
{"enableNetworkPolicy":"true","nodeAgent":{"enablePolicyEventLogs":"true","enableCloudWatchLogs":"true"}}
Further deep dive into the issue, I found it's IAM policy setup issue
After adding missing IAM Policies, everything works as expected.
Post-incident suggestions
By following the guidance HERE, the IAM Policy setup in the doc is now "after" the step to enable enableCloudWatchLogs
but not "before" ( It should be mentioned before it is enabled ). It's really hard to identify the issue with no log emit.