Tetragon with ociHookSetup.enabled cannot start in a namespace other than kube-system
f1ko opened this issue · comments
What happened?
Description
When using the OCI hook feature, an exception for all Pods in the kube-system
namespace are added so that Tetragon itself can start as the hook has a dependency on Tetragon.
However, this results in a deadlock scenario if Tetragon is being deployed in any other namespace.
Reproduction
Install Tetragon with the oci-hook feature enabled inside another namespace (i.e. not in kube-system
):
kubectl create ns tetragon
helm install --namespace tetragon \
--set tetragonOperator.image.override=localhost/cilium/tetragon-operator:latest \
--set tetragon.image.override=localhost/cilium/tetragon:latest \
--set tetragon.grpc.address="unix:///var/run/cilium/tetragon/tetragon.sock" \
--set tetragon.ociHookSetup.enabled=true \
tetragon ./install/kubernetes/tetragon
The init container starts as expected and configures the oci-hook.
However, this leads to the agent never being able to start as the oci-hook cannot reach the agent and the only exception being Pods in the kube-system
namespace:
$ kubectl describe pod -n tetragon tetragon-tctms
[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 54s default-scheduler Successfully assigned tetragon/tetragon-tctms to minikube
Normal Pulling 53s kubelet Pulling image "localhost/cilium/tetragon:latest"
Normal Pulled 53s kubelet Successfully pulled image "localhost/cilium/tetragon:latest" in 11ms (11ms including waiting). Image size: 215976744 bytes.
Normal Created 53s kubelet Created container oci-hook-setup
Normal Started 53s kubelet Started container oci-hook-setup
Warning Failed 43s kubelet Error: container create failed: time="2024-04-29T11:36:18Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: "
Warning Failed 33s kubelet Error: container create failed: time="2024-04-29T11:36:28Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: "
Normal Pulled 21s (x2 over 53s) kubelet Container image "quay.io/cilium/hubble-export-stdout:v1.0.4" already present on machine
Normal Pulled 11s (x2 over 43s) kubelet Container image "localhost/cilium/tetragon:latest" already present on machine
Warning Failed 11s kubelet Error: container create failed: time="2024-04-29T11:36:50Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: "
Warning Failed 1s kubelet Error: container create failed: time="2024-04-29T11:37:00Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: "
Consequence
This leaves the cluster in a bad state as no Pods (other than those in kube-system
) are being created, including Tetragon itself.
The only way to restore cluster functionality at this point is by removing the oci-hook configuration on the node.
Tetragon Version
$ tetra version
CLI version: v1.1.0-pre.0-794-gbdcb413f0
Kernel Version
$ uname -a
Linux minikube 6.4.16 #1 SMP Mon Sep 18 21:45:38 UTC 2023 aarch64 Linux
Kubernetes Version
No response
Bugtool
No response
Relevant log output
No response
Anything else?
No response