aws / aws-network-policy-agent

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Network Policy agent reconciler crashes with a runtime error when pod name contains dot(.) .

ibnjunaid opened this issue · comments

What happened:
Network Policy agent crashes when pod name contains . (dot) with the following runtime error.

Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference

Attached logs:

NAMESPACE     NAME                              READY   STATUS             RESTARTS      AGE
default       sample-app.app-7c547489fb-gmnj9   1/1     Running            0             3m46s
kube-system   aws-node-nmxkh                    1/2     CrashLoopBackOff   5 (54s ago)   4m23s
kube-system   coredns-664b6f5f5c-9kq9w          1/1     Running            0             46h
kube-system   coredns-664b6f5f5c-jhzmm          1/1     Running            0             46h
kube-system   kube-proxy-7g9qg                  1/1     Running            0             5h1m
 containerStatuses:
  - containerID: containerd://62da97335e19f10f460f78b12dea2a60e0a8e3938a6a7e5f38b7e72e00fff4cf
    image: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon/aws-network-policy-agent:v1.0.2-eksbuild.1
    imageID: 602401143452.dkr.ecr.us-east-2.amazonaws.com/amazon/aws-network-policy-agent@sha256:71fbb862ba51217f4c8a22502cba6fa8baa098b80590ea753378694b7c82a4db
    lastState:
      terminated:
        containerID: containerd://62da97335e19f10f460f78b12dea2a60e0a8e3938a6a7e5f38b7e72e00fff4cf
        exitCode: 2
        finishedAt: "2023-10-30T09:55:23Z"
        reason: Error
        startedAt: "2023-10-30T09:55:23Z"
    name: aws-eks-nodeagent
    ready: false
    restartCount: 6
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=aws-eks-nodeagent pod=aws-node-nmxkh_kube-system(ba6898d2-9c89-45e3-8202-2d97012e760f)
        reason: CrashLoopBackOff
{"level":"info","timestamp":"2023-10-30T09:46:22.273Z","msg":"Starting EventSource","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","source":"kind source: *v1alpha1.PolicyEndpoint"}
{"level":"info","timestamp":"2023-10-30T09:46:22.273Z","msg":"Starting Controller","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint"}
{"level":"info","timestamp":"2023-10-30T09:46:22.384Z","msg":"Starting workers","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","worker count":1}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Received a new reconcile request","req":{"name":"deny-all-ingress-rcsxk","namespace":"default"}}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Processing Policy Endpoint  ","Name: ":"deny-all-ingress-rcsxk","Namespace ":"default"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Found a matching Pod: ","name: ":"sample-app.app-7d6fd579dd-tm7gn","namespace: ":"default"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Derived ","Pod identifier: ":"sample-app.app-7d6fd579dd-default"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Total number of PolicyEndpoint resources for","podIdentifier ":"sample-app.app-7d6fd579dd-default"," are ":1}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Deriving Firewall rules for PolicyEndpoint:","Name: ":"deny-all-ingress-rcsxk"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Total no.of - ","ingressRules":0,"egressRules":0}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Default Deny enabled on Ingress"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"No Egress rules and no egress isolation - Appending catch all entry"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"controllers.policyEndpoints","msg":"Processing Pod: ","name:":"sample-app.app-7d6fd579dd-tm7gn","namespace:":"default","podIdentifier: ":"sample-app.app-7d6fd579dd-default"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"ebpf-client","msg":"AttacheBPFProbes for","pod":"sample-app.app-7d6fd579dd-tm7gn"," in namespace":"default"," with hostVethName":"enifccdc72df53"}
{"level":"info","timestamp":"2023-10-30T09:46:22.385Z","logger":"ebpf-client","msg":"Load the eBPF program"}
{"level":"info","timestamp":"2023-10-30T09:46:22.386Z","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","PolicyEndpoint":{"name":"deny-all-ingress-rcsxk","namespace":"default"},"namespace":"default","name":"deny-all-ingress-rcsxk","reconcileID":"988d852f-07fb-4ef2-9dfa-b21f263699ab"}

Sample Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    io.kompose.service: sample-app
  name: sample-app.app
spec:
  replicas: 1
  selector:
    matchLabels:
      io.kompose.service: sample-app
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        io.kompose.service: sample-app
    spec:
      serviceAccountName: default
      containers:
        - name: nginx
          image: nginx
          ports:
            - containerPort: 80
          resources: {}

Sample Policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
spec:
  podSelector: 
    matchLabels: {}
  policyTypes: ["Ingress"] 
  ingress:
    - from:
        - podSelector:
            matchExpressions:
              - key: run
                values: 
                  - ns1
                  - ns2
                  - ns3
                operator: In

I was able to repro this -

{"level":"info","ts":"2023-10-30T20:55:06.503Z","logger":"ebpf-client","caller":"controllers/policyendpoints_controller.go:232","msg":"AttacheBPFProbes for","pod":"hello-udp.app-5d5567585b-s4gx9"," in namespace":"default"," with hostVethName":"eniba5486f0e22"}
{"level":"info","ts":"2023-10-30T20:55:06.503Z","logger":"ebpf-client","caller":"ebpf/bpf_client.go:412","msg":"Load the eBPF program"}
{"level":"info","ts":"2023-10-30T20:55:06.503Z","caller":"runtime/panic.go:261","msg":"Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference","controller":"policyendpoint","controllerGroup":"networking.k8s.aws","controllerKind":"PolicyEndpoint","PolicyEndpoint":{"name":"test-network-policy-block-ingress-cqt2f","namespace":"default"},"namespace":"default","name":"test-network-policy-block-ingress-cqt2f","reconcileID":"6d9c6641-4bc5-48a0-a27f-5b2459fdabff"}

It is happening for the pinpath file exists check in SDK should be the string formatting of the pinpath populated from agent to SDK..

kubectl logs aws-node-pmjkf -n kube-system -c aws-eks-nodeagent
{"level":"info","ts":"2023-10-30T20:50:01.340Z","caller":"runtime/asm_amd64.s:1650","msg":"version","GitVersion":"","GitCommit":"","BuildDate":""}
2023-10-30 20:50:01.401730295 +0000 UTC Logger.check error: failed to get caller
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x55c628b4f216]

goroutine 58 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.2/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x55c62a690b40?, 0x55c62bae83e0?})
	/root/sdk/go1.21.3/src/runtime/panic.go:914 +0x21f
github.com/aws/aws-ebpf-sdk-go/pkg/utils.IsfileExists({0xc0007a40a0?, 0x0?})
	/go/pkg/mod/github.com/aws/aws-ebpf-sdk-go@v1.0.3/pkg/utils/utils.go:90 +0x56
github.com/aws/aws-ebpf-sdk-go/pkg/maps.(*BpfMap).PinMap(0xc000998c58, {0xc0007a40a0, 0x49}, 0x7ba3d0?)
	/go/pkg/mod/github.com/aws/aws-ebpf-sdk-go@v1.0.3/pkg/maps/loader.go:232 +0x56
github.com/aws/aws-ebpf-sdk-go/pkg/maps.(*BpfMap).CreateBPFMap(0x1?, {0xb, 0x8, 0x120, 0x10000, 0x1, 0x0, 0xc000012c90, 0x0, {0xc0007924f0, ...}})
	/go/pkg/mod/github.com/aws/aws-ebpf-sdk-go@v1.0.3/pkg/maps/loader.go:219 +0x57c
github.com/aws/aws-ebpf-sdk-go/pkg/elfparser.(*elfLoader).loadMap(0xc0009991f8, {0xc00050ebc0, 0x1, 0xc0004bab60?})
	/go/pkg/mod/github.com/aws/aws-ebpf-sdk-go@v1.0.3/pkg/elfparser/elf.go:165 +0x27d
github.com/aws/aws-ebpf-sdk-go/pkg/elfparser.(*elfLoader).doLoadELF(0xc0009991f8)
	/go/pkg/mod/github.com/aws/aws-ebpf-sdk-go@v1.0.3/pkg/elfparser/elf.go:600 +0x65
github.com/aws/aws-ebpf-sdk-go/pkg/elfparser.(*bpfSDKClient).LoadBpfFile(0xc00004e380, {0x55c6299fa71b?, 0xc00078f400?}, {0xc000500220, 0x20})
	/go/pkg/mod/github.com/aws/aws-ebpf-sdk-go@v1.0.3/pkg/elfparser/elf.go:140 +0x1d9
github.com/aws/aws-network-policy-agent/pkg/ebpf.(*bpfClient).loadBPFProgram(0xc000140000, {0x55c6299fa71b, 0x12}, {0x55c6299ea87b, 0x7}, {0xc000500220, 0x20})
	/workspace/pkg/ebpf/bpf_client.go:619 +0xef
github.com/aws/aws-network-policy-agent/pkg/ebpf.(*bpfClient).attachIngressBPFProbe(0xc000140000, {0xc000792240, 0xe}, {0xc000500220, 0x20})
	/workspace/pkg/ebpf/bpf_client.go:504 +0x1f6

Btw, it is not recommended to have a "." in the name since it will interfere with DNS host names. You would see the below warning -

Warning: metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]

os.Stat is failing when the file name has a dot(.)i.e, most probably treating it as a file type in /sys/fs/bpf FS and similarly with touch -

sudo touch test.app_ingress_map
touch: setting times of ‘test.app_ingress_map’: Operation not permitted
{"level":"info","ts":"2023-10-30T22:10:02.515Z","caller":"maps/loader.go:232","msg":"Stat failed"}
panic: stat /sys/fs/bpf/globals/aws/maps/test.app_ingress_map: operation not permitted

@jayanthvn

One use case I normally see when some one tries to debug a node with the following command.

kubectl debug node/<IP>.<REGION>.compute.internal -it --image=nicolaka/netshoot

This creates a pod containing "." in the name causing the network policy agent to crash.

@ibnjunaid - Right now we are limited by what os.Stat allows because we need to create pinpath in bpfFS. But we will explore some options and let you know. It shouldn't crash though which we will fix but allowing "." needs some alternate handling.

@jayanthvn any updates on it ?

Hi @mjnovice, I have a PR open for resolving this issue, the PR will be getting reviewed soon.

Any update on this ?

Fix is released with network policy agent v1.1.2. - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.18.2. Please test and let us know if there are any issues.