Traffic agent fails to start up due to security context

Question

Traffic agent fails to start up due to security context

bpfoster opened this issue 6 months ago · comments

Describe the bug
Beginning in telepresence v2.15.1, trying to intercept a service whose container defines a restrictive securityContext leads to the traffic agent failing to start. This looks to be caused by the changes in that version to agentconfig.container, where the security context of the first container is now copied to the traffic agent.

To Reproduce
Steps to reproduce the behavior:

When I run telepresence intercept "${SERVICE_NAME}"
I see

telepresence intercept: error: connector.CreateIntercept: Back-off restarting failed container traffic-agent in pod SERVICE_NAME-cd6f7df8d-mzwsm_default(ca141105-a99b-4f59-aaea-8e0ff9159427)
The logs of Pod SERVICE_NAME-cd6f7df8d-mzwsm might provide more detai

So I look at the logs for the traffic-agent container in that pod and see

exec /usr/local/bin/traffic: operation not permitted

Expected behavior
A successful intercept.

Versions (please complete the following information):

Output of telepresence version

OSS Client         : v2.18.0
OSS Root Daemon    : v2.18.0
OSS User Daemon    : v2.18.0
OSS Traffic Manager: v2.18.0
Traffic Agent      : docker.io/datawire/tel2:2.18.0

Operating system of workstation running telepresence commands

NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"

Kubernetes environment and Version [e.g. Minikube, bare metal, Google Kubernetes Engine]

kind v0.21.0 go1.20.13 linux/amd64

Additional context
The security context of the main container, that I can see is copied to the traffic-agent container, is:

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
    - ALL
  readOnlyRootFilesystem: true
  runAsGroup: 10001
  runAsNonRoot: true
  runAsUser: 10001
  seccompProfile:
    type: RuntimeDefault

Thomas Hallgren · Answer 1 · Wed Feb 14 2024 06:52:28 GMT+0800 (China Standard Time)

I'm not sure what the best solution would be to circumvent this. If a restrictive securityContext is desired, then I'd imagine that in most cases, it would be considered bad to lessen the security for the injected traffic-agent.

What solution would you envision for this? Would overriding the agent's security context with a pre-defined one that the traffic-manager always uses suffice? Or perhaps an annotation describing it in the workload?

Ben Foster · Answer 2 · Wed Feb 14 2024 20:54:36 GMT+0800 (China Standard Time)

Generally we allow configuring the security of containers' security context individually. Most of our containers can run in a very locked-down context like this but some require a less restrictive context so we configure those contexts as needed - trying to be as restrictive as possible while still allowing necessary functionality.

It would appear that the agent is one such use case that cannot function in such a context. Looks like it might need some capability that was dropped.
I think the first question is, are there any changes that can be made to the agent that allow it to run without the added capabilities? My guess is this is not a viable route.

Our use case is fairly simple. We don't run telepresence in the production clusters, so don't have any requirement on the agent running in a specific security context. From that perspective, I don't have a concern with it running without a context at all, as it did prior to v2.15.1.

If you want to air on the side of more restrictive by default, and are able to determine an appropriate context for the agent, I think setting the agent's context with a pre-defined one that the traffic-manager defines would be a great option.

Perhaps if there is some more complex use case that has a need for users to provide a specific context, allowing this via an annotation (or other configuration mechanism - I don't know but wouldn't this likely be a global configuration vs per-workload?) could be another option. But personally I believe this should be an override. The default behavior would best (IMHO) be to either set an empty context, or a pre-defined, known-working context, so that the agent works out of the box.

Thomas Hallgren · Answer 3 · Thu Feb 15 2024 00:14:57 GMT+0800 (China Standard Time)

@bpfoster Thanks for your elaborate answer. I'll see what we can do about this.

The reason we're now setting a securityContext is that some of our customers have policies in place that requires this, but I see no problem adding a Helm-chart value that would override it, nor do I see any problems in letting such an override be an empty context.

Ben Foster · Answer 4 · Thu Apr 04 2024 04:03:08 GMT+0800 (China Standard Time)

Hi @thallgren , I hope you don't mind, I took a stab at this in PR #3563 based on the approach you described. It seemed like a pretty straightforward change.

I do not have an account on the slack workspace to post to the channel as requested on the PR template.