aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Observed a panic in reconciler when creating a single policy

badgerspoke opened this issue · comments

What happened:
I applied this policy to a test cluster:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-metadata-access
  namespace: gitlab-ci
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 169.254.169.254/32

Attach logs

...
{"level":"info","ts":"2024-03-08T03:56:22.210Z","logger":"controllers.policyEndpoints","caller":"runtime/proc.go:267","msg":"ConntrackTTL","cleanupPeriod":300}
{"level":"info","ts":"2024-03-08T03:56:22.210Z","logger":"setup","caller":"runtime/asm_amd64.s:1650","msg":"starting manager"}
{"level":"info","ts":"2024-03-08T03:56:22.210Z","logger":"controller-runtime.metrics","caller":"runtime/asm_amd64.s:1650","msg":"Starting metrics server"}
{"level":"info","ts":"2024-03-08T03:56:22.210Z","logger":"setup","msg":"Serving metrics on ","port":61680}
{"level":"info","ts":"2024-03-08T03:56:22.210Z","logger":"controller-runtime.metrics","caller":"runtime/asm_amd64.s:1650","msg":"Serving metrics server","bindAddress":":8162","secure":false}
{"level":"info","ts":"2024-03-08T03:56:22.211Z","caller":"runtime/asm_amd64.s:1650","msg":"starting server","kind":"health probe","addr":"[::]:8163"}
{"level":"info","ts":"2024-03-08T03:56:22.211Z","caller":"manager/runnable_group.go:223","msg":"Starting EventSource","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","source":"kind source: *v1alpha1.PolicyEndpoint"}
{"level":"info","ts":"2024-03-08T03:56:22.211Z","caller":"manager/runnable_group.go:223","msg":"Starting Controller","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint"}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","caller":"manager/runnable_group.go:223","msg":"Starting workers","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","worker count":1}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","logger":"controllers.policyEndpoints","caller":"controller/controller.go:316","msg":"Received a new reconcile request","req":{"name":"ingress-nginx-admission-m78zj","namespace":"ingress-nginx"}}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:127","msg":"Processing Policy Endpoint  ","Name: ":"ingress-nginx-admission-m78zj","Namespace ":"ingress-nginx"}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Parent NP resource:","Name: ":"ingress-nginx-admission"}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Found another PE resource for the parent NP","name":"ingress-nginx-admission-m78zj"}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Total PEs for Parent NP:","Count: ":1}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Derive PE Object ","Name ":"ingress-nginx-admission-m78zj"}
{"level":"info","ts":"2024-03-08T03:56:22.319Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Processing PE ","Name ":"ingress-nginx-admission-m78zj"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controller/controller.go:316","msg":"Received a new reconcile request","req":{"name":"deny-metadata-access-7vdfq","namespace":"gitlab-ci"}}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:127","msg":"Processing Policy Endpoint  ","Name: ":"deny-metadata-access-7vdfq","Namespace ":"gitlab-ci"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Parent NP resource:","Name: ":"deny-metadata-access"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Found another PE resource for the parent NP","name":"deny-metadata-access-7vdfq"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Total PEs for Parent NP:","Count: ":1}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Derive PE Object ","Name ":"deny-metadata-access-7vdfq"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Processing PE ","Name ":"deny-metadata-access-7vdfq"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Found a matching Pod: ","name: ":"cluster-autoscaler-ci-68594765bc-6ll69","namespace: ":"gitlab-ci"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Derived ","Pod identifier: ":"cluster-autoscaler-ci-68594765bc-gitlab-ci"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:451","msg":"Current PE Count for Parent NP:","Count: ":1}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Found a matching Pod: ","name: ":"gitlab-runner-cluster-operator-platform-test-cluster-6964f77vs68","namespace: ":"gitlab-ci"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:196","msg":"Derived ","Pod identifier: ":"gitlab-runner-cluster-operator-platform-test-cluster-gitlab-ci"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:451","msg":"Current PE Count for Parent NP:","Count: ":1}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Total number of PolicyEndpoint resources for","podIdentifier ":"cluster-autoscaler-ci-68594765bc-gitlab-ci"," are ":1}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Deriving Firewall rules for PolicyEndpoint:","Name: ":"deny-metadata-access-7vdfq"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Total no.of - ","ingressRules":0,"egressRules":1}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:127","msg":"No Ingress rules and no ingress isolation - Appending catch all entry"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","logger":"controllers.policyEndpoints","caller":"controllers/policyendpoints_controller.go:146","msg":"Processing Pod: ","name:":"cluster-autoscaler-ci-68594765bc-6ll69","namespace:":"gitlab-ci","podIdentifier: ":"cluster-autoscaler-ci-68594765bc-gitlab-ci"}
{"level":"info","ts":"2024-03-08T03:58:45.670Z","caller":"runtime/panic.go:261","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","PolicyEndpoint":{"name":"deny-metadata-access-7vdfq","namespace":"gitlab-ci"},"namespace":"gitlab-ci","name":"deny-metadata-access-7vdfq","reconcileID":"6114f7d5-251e-4547-aa96-dbe5b1700379"}

What you expected to happen:

TBH I thought it would apply the policy without much fanfare. The policy appeared not to be applied as it definitely didn't stop the IMDS connectivity as I was expecting

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.28.3
  • CNI Version 1.16.0
  • Network Policy Agent Version 1.0.7
  • OS (e.g: cat /etc/os-release): AL2
  • Kernel (e.g. uname -a): 5.10.192-183.736.amzn2.x86_64

It appears that you've not enabled network policy feature in the node agent container - https://github.com/aws/aws-network-policy-agent?tab=readme-ov-file#enable-network-policy. Feature is disabled by default.

But I have the VPC CNI installed, from manifest, and this config in the CM:

% kc -n kube-system get cm amazon-vpc-cni -o yaml|yq .data
branch-eni-cooldown: "60"
enable-network-policy-controller: "true"
enable-windows-ipam: "false"
enable-windows-prefix-delegation: "false"
minimum-ip-target: "3"
warm-ip-target: "1"
warm-prefix-target: "0"

I need to do something else in addition?

The docs are confusing, both here and at AWS where each says you need to install something but also state that by deploying an EKS cluster greater than x that install is done automatically.

As far as I can make out, the NPC comes for 'free' with recent AWS VPC CNI (which I have) but you need to toggle it via the CM, which I've done. Is there some guide without contradictions I can follow to enable this in EKS?
Not to mention that it shouldn't panic if something isn't turned on sufficiently? This doesn't feel super robust

All the required components are installed by default but as I said above, the feature is disabled by default. One needs to turn on the feature in Network Policy controller running in EKS control plane via the ConfigMap option and the local node agent via the flag I linked. Refer to this doc - https://docs.aws.amazon.com/eks/latest/userguide/cni-network-policy.html. If you're using Managed addons or helm one needs to enable the network policy flag provided and that will enable it across all the required components.

Now, the above scenario occurs when the feature is enabled in ConfigMap but disabled in the local node agent. So, instead of leaving the node/feature in an undefined state we decided to fail hard in those unsupported config scenarios. Well, w.r.t reason behind that decision is our prior experience with these scenarios in VPC CNI where we spent so much time debugging weird issues only to realize at the end that the user has incorrect config and so we decided to adopt this approach for NP.

Ah right ok thanks. You are of course correct, I've missed the enable-network-policy setting on the DS. With both the toggles enabled it works.

And I get your point about how the plugin responds to misconfiguration, especially as I've seen in other issues how removing/unwinding network policy in the wrong order can create chaos/challenges. I'm onboard with the approach but maybe a log line to state that there's some unknown/incorrect misconfiguration or something before the panic? The panic and subsequent boot loop is great for surfacing that cluster admins need to do something, but it's not super clear what that something is if it just falls over constantly. Or else something in the docs (perhaps I already missed) which states that it will fall over if the options/config are incorrect or contradictory?

Anyway this is my fault for failing to read the doc properly, sorry about that.