aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VPC CNI plugin crashing when enabling cloudwatch logs for network policy logs

mahasiva-amazon opened this issue · comments

What happened:

  1. Created a cluster with VPC CNI Plugin with network policy true.
  2. Added permission to the service role to enable CloudWatch logging as defined here. (https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html)
  3. The used eks update-on cli to enable CloudWatch logging

aws eks update-addon --cluster-name ${EKS_CLUSTER_NAME} --addon-name "vpc-cni" --configuration-values '{"env":{"ENABLE_PREFIX_DELEGATION":"true", "ENABLE_POD_ENI":"true", "POD_SECURITY_GROUP_ENFORCING_MODE":"standard"},"enableNetworkPolicy": "true", "nodeAgent": { "enableCloudWatchLogs": "true", "healthProbeBindAddr": "8163", "metricsBindAddr": "8162"}}'

  1. Post this command, the aws-node daemonset pods start crashing and futher analysis looks like the aws-node-agent containers in the pod are crashing. The issue does not go away even if we delete the add-on and again install it.

Attach logs

Normal Scheduled 52s default-scheduler Successfully assigned kube-system/aws-node-45nmc to ip-XXXX.us-west-2.compute.internal
Normal Pulling 52s kubelet Pulling image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.14.1-eksbuild.1"
Normal Pulled 49s kubelet Successfully pulled image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.14.1-eksbuild.1" in 2.696970025s (2.696982854s including waiting)
Normal Created 49s kubelet Created container aws-vpc-cni-init
Normal Started 49s kubelet Started container aws-vpc-cni-init
Normal Pulling 48s kubelet Pulling image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.14.1-eksbuild.1"
Normal Pulled 46s kubelet Successfully pulled image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.14.1-eksbuild.1" in 1.550764534s (1.550796824s including waiting)
Normal Created 46s kubelet Created container aws-node
Normal Started 46s kubelet Started container aws-node
Normal Pulling 46s kubelet Pulling image "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.2-eksbuild.1"
Normal Pulled 33s kubelet Successfully pulled image "XXXX.dkr.ecr.us-west-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.2-eksbuild.1" in 13.02422571s (13.02424398s including waiting)
Normal Created 33s kubelet Created container aws-eks-nodeagent
Normal Started 33s kubelet Started container aws-eks-nodeagent
Warning Unhealthy 28s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:18.910Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service ":50051" within 5s"}
Warning Unhealthy 23s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:23.969Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service ":50051" within 5s"}
Warning Unhealthy 17s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:29.021Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service ":50051" within 5s"}
Warning Unhealthy 12s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:34.077Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service ":50051" within 5s"}
Warning Unhealthy 7s kubelet Readiness probe failed: {"level":"info","ts":"2023-11-21T18:27:39.591Z","caller":"/root/sdk/go1.20.4/src/runtime/proc.go:250","msg":"timeout: failed to connect service ":50051" within 5s"}
What you expected to happen:

  1. The add-on to be updated with correct logging configuration.
    How to reproduce it (as minimally and precisely as possible):
    Refer earlier section
    Anything else we need to know?:
    N/A
    Environment:
  • Kubernetes version (use kubectl version): 1.27
  • CNI Version - v1.15.3-eksbuild.1
  • Network Policy Agent Version - v1.01
  • OS (e.g: cat /etc/os-release): Amazon Linux
  • Kernel (e.g. uname -a): Linux ..... 5.10.186-179.751.amzn2.x86_64 #1 SMP Tue Aug 1 20:51:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I've faced the same issue. Documentation is not clear about this point. By looking at /var/log/aws-routed-eni/ipamd.log on the node it seems to be an authorization issue:

{"level":"error","ts":"2023-12-06T13:51:53.616Z","caller":"ipamd/ipamd.go:457","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-03****** eni-07********]: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: df464bdf-eb18-4b85-*******"}
{"level":"error","ts":"2023-12-06T13:51:53.727Z","caller":"aws-k8s-agent/main.go:32","msg":"Initialization failure: ipamd init: failed to retrieve attached ENIs info: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: df464bdf-****"}

I have resolved it by adding the permissions AmazonEKS_CNI_Policy to my role

Hi @ariary Can you give more details as to how you ended up with the issue ? Which role did you ended up adding the permission to (the node role or the CNI-addon role) ? The above issue happened since the create/update addon call did not pass the service-role-arn to use for CNI

I have created a specific role with permissions for the policy I mentioned above + the one which is defined in the documentation (for cloud watch log)
For this role I check that aws-node service account can assume it (cf trust relationship in UI)
Then you can update your add-on by specifying the adding-role arn (—service-account-role-arn)

Note also, that to get logs you also need in your node agent configuration "enablePolicyLogs": "true"

I have created a specific role with permissions for the policy I mentioned above + the one which is defined in the documentation

So if I understand this correct.. You created a new role and added cloudwatch log policy to the role for network policy logs. CNI then complained about not having the right authorization, which is when you added the AmazonEKS_CNI_POLICY ?

Exactly

Thanks for the details.. So we do recommend to add the cloudwatch log policy to the existing CNI IAM role (which would already have the AmazonEKS_CNI_Policy attached). This is also being called out in the prerequisites section of the docs here..

https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html#network-policies-troubleshooting
Add the following permissions as a stanza or separate policy to the IAM role that you are using for the VPC CNI.

Let me know if this helps

@jaydeokar indeed! Just maybe it would be helpful to specify which role we are talking about, as if we are using "default" configuration we have Service account role:Inherited from node. Thus leading to create a new role with only the policy mentioned.

I experience the same issue, I cannot enable cloudwatch logs. The aws-node-agent falls into crash loopback.
My VPC-CNI configs
{"enableNetworkPolicy":"true","nodeAgent":{"enableCloudWatchLogs":"true"}}
My VPC-CNI version
v1.15.0-eksbuild.2
My EKS version
1.28
I tried assigning IAM permissions directly to Addon and inherited from kubernetes instances, same result. I used arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
This is the only log message I get in aws-eks-nodeagent container
{"level":"info","ts":"2024-01-03T17:16:14Z","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}

When I manually disable cloudwatch by editing aws-node daemonset and overwriting the cloudwatch switch it starts working
--enable-cloudwatch-logs=false
here is the generated manifest for vpc-cni-driver manifest
aws-node.yaml.txt

Hi @Mihail-blip
The accept/deny logs should be available in /aws/eks/<cluster-name>/cluster cloudwatch. We don't log anything in the stdout for aws-eks-nodeagent container. Also make sure you have { "nodeAgent": {"enablePolicyEventLogs": "true"}
in order for the agent to start logging the accept/deny logs.

There do not seem to be any open items on this issue, so closing as resolved

commented

@Mihail-blip , you need to include the CloudWatch permissions in your IAM role (https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html#cni-iam-role-create-role) or in the IAM role for EKS nodes. Additionally, make sure to configure { "nodeAgent": {"enablePolicyEventLogs": "true"} } (#129).