aws-eks-nodeagent container logs errors on startup and shutdown

Question

aws-eks-nodeagent container logs errors on startup and shutdown

rtomadpg opened this issue 7 months ago · comments

What happened:

After upgrading VPC-CNI from v1.14.1-eksbuild.1 to v1.15.4-eksbuild.1 all the aws-eks-nodeagent containers logged:

aws-node-np4cq aws-eks-nodeagent 2023-12-06 16:14:59.823264484 +0000 UTC Logger.check error: failed to get caller

And, when I delete a random aws-node pod, I see this:

aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131300614 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131410269 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131480895 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131594396 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131647113 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131669285 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.131694685 +0000 UTC Logger.check error: failed to get caller
aws-node-sdp94 aws-eks-nodeagent 2023-12-06 16:25:56.13179858 +0000 UTC Logger.check error: failed to get caller

I believe these errors comes from the uber-go/zap dependency, see https://github.com/uber-go/zap/blob/5acd569b6a5264d4c7433cbb278a8336d491715c/logger.go#L398

As I am unsure this error is signalling something is (really) wrong and this error was not logged in this project yet, I created the bug.

Attach logs

Let me know if needed.

What you expected to happen:

No errors getting logged.

How to reproduce it (as minimally and precisely as possible):

Upgrade to the mentioned version
Check the aws-node pod logs
Or, delete a aws-node pod. New pod will log the errors.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.27.7-eks-4f4795d
CNI Version: v1.15.4-eksbuild.1
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a):

Linux <hostname redacted> 5.10.192-183.736.amzn2.x86_64 aws/amazon-vpc-cni-k8s#1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Jeffrey Nelson · Answer 1 · Thu Dec 07 2023 00:38:17 GMT+0800 (China Standard Time)

@rtomadpg just curious, did you notice the comment with:

For Network Policy issues, please file at https://github.com/aws/aws-network-policy-agent/issues

when you opened this issue? We are trying to improve the experience here with triaging Network Policy agent issues, so I am wondering if you think there is a better way this could have been noticed.

Jeffrey Nelson · Answer 2 · Thu Dec 07 2023 00:39:39 GMT+0800 (China Standard Time)

As for this issue, this is the same as #103. This error log is harmless, and a fix is in progress

Renzo Tomà @ DPG · Answer 3 · Thu Dec 07 2023 00:42:45 GMT+0800 (China Standard Time)

Ouch, so sorry! I checked the new bug flow and indeed that comment is there. Very clearly.
I guess I was too eager to file the bug (end of work day here) and I overlooked that part.

Renzo Tomà @ DPG · Answer 4 · Thu Dec 07 2023 00:47:56 GMT+0800 (China Standard Time)

@jdn5126 maybe a suggestion: when errors are logged by a container named "aws-eks-nodeagent" it's not immediately clear that's related to "Network Policy issues" or "aws-network-policy-agent". Maybe a mention of "aws-eks-nodeagent" in that comment will reduce wrongly filed issues?

Jeffrey Nelson · Answer 5 · Thu Dec 07 2023 00:54:11 GMT+0800 (China Standard Time)

Ouch, so sorry! I checked the new bug flow and indeed that comment is there. Very clearly. I guess I was too eager to file the bug (end of work day here) and I overlooked that part.

Oh no worries, I was just curious if there was a better setup through GitHub. Good call, I can expand the comment

lsabreu96 · Answer 6 · Wed Mar 06 2024 00:10:39 GMT+0800 (China Standard Time)

Hi everyone, sorry jumpin in on a closed thread.

I'm facing the same issue, but without the network policy error mentioned here.
I'm tryint to upgrade a managed worker group to 1.25 but the aws-node daemonset keeps failing in aws-eks-nodeagent container, causing the pod to restart

Any ideas ?
The VPC CNI plugin version is on v1.15.1-eksbuild.1

Jeffrey Nelson · Answer 7 · Wed Mar 06 2024 00:42:55 GMT+0800 (China Standard Time)

@lsabreu96 the error log from this issue is harmless. If you are seeing the aws-eks-nodeagent container crashing, please file a new issue with the logs from the crash, which you can find in /var/log/aws-routed-eni/network-policy-agent.log on the affected node.

Koen Karsten · Answer 8 · Wed Mar 13 2024 16:51:25 GMT+0800 (China Standard Time)

For anyone reaching this thread because the aws-eks-nodeagent container is crashing with UTC Logger.check error: failed to get caller: For me the issue was mixing EKS k8s version 1.24 with aws-network-policy-agent:v1.0.4-eksbuild.1 and amazon-k8s-cni:v1.15.1-eksbuild.1 (These versions were automatically provisioned by EKS). Upgrading to k8s version 1.25 fixes the container crashing loop, as mentioned on the README of this repo (You’ll need a Kubernetes cluster version 1.25+ to run against.).

So not commenting to reopen this issue, just provide information if anyone still running 1.24 lands here!