xdp-project / xdp-tools

Utilities and example programs for use with XDP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[libxdp] component programs, multiple network namespaces and duplicate netdev indices?

thediveo opened this issue · comments

Section "Locking and pinning" of "Protocol for atomic loading of multi-prog dispatchers" details the bpffs path names used for pinning. According to documentation, these are in the form /sys/fs/bpf/xdp/dispatch-IFINDEX-DID:

  • IFINDEX netdev index
  • DID component program

Is this current scheme safe for use with multiple network namespaces and a single instance of the bpffs? From what I have seen in the past, interface indices aren't unique across network namespaces.

Are there any plans on making libxdp netns-aware?

What would be the recommendation for deploying container workloads that use libxdp?

I don't know that it is fair to say there are no official recommendations about root in containers. Here is Red Hat Podman/container security expert Dan Walsh's article Just Say No to Root in Containers.
https://opensource.com/article/18/3/just-say-no-root-containers

I think the relevant question to ask yourself might be: is there anything about a theoretical XDP container that negates Dan Walsh's arguments? I have a feeling Dan Walsh would probably just say no. But I've not asked him myself.

The discussion about root is mostly moot when it comes to capabilities; the main reason about not using UID 0 is due to the VFS DAC not having aged too well into the container era. My own open source namespace and mount discovery engine lxkns runs as UID!=0 with a small set of highly useful capabilities and has access to the whole system, in combination with the host PID namespace. The root advice is taken out of proportion instead of understood as to what it actually is: one lesser line of defense in a defense in depth design.

Therefore I would rather not like to go off a tangent on Dan's work here. The libxdp kernel paths don't require UID0, they require CAP_NET_RAW for creating an XSK and CAP_EBPF for rather new kernels, or alternatively CAP_SYS_ADMIN for the commonly in-use kernels.

There is the often overlooked pattern of sidecar containers that might be used here for privilege separation.

What worries me instead: if the container gets OOM'd for some reason or -9'ed for some other, then what happens to the things libxdp left behind it, because the netdev won't die but instead return into the initial network namespace? And the pinning's even gone now!

Looking at an old LWN post "Persistent BPF Objects":

So what we have instead is yet another special kernel virtual filesystem. This one is meant to be mounted at /sys/fs/bpf. It is a singleton filesystem, meaning that it can be mounted multiple times within a single namespace and every mount will see the same directory tree. Each mount namespace will, however, get its own version of this filesystem.

It wouldn't be a proper party without CAP_SYS_ADMIN, it seems.

So the dispatcher survives, as it is still claimed by its netdev.

What about the XDP program components loaded into the dispatcher? Aren't these now still claimed by the dispatcher? You say so, but then this part wouldn't be properly reference managed? Or does freplace introduce some new reference/lifetime semantics that allow reclaiming components and the dispatcher not actually reference counting them?

If not then I'm obviously not having a suitable mental model of the things going on here, not that this would surprise me.

I would generally advice people to just treat CAP_BPF as root anyway, there have
been way too many privilege escalation attacks to assume otherwise...

Without going into the weeds too much: the most obvious use case to me here is load balancing/BGP. In that case, if your load balancing is going to benefit from XDP a lot then for security's sake maybe just run your load balancers in VMs, separate from the K8s cluster. So, it looks like for the time being, you'll have to way that performance tradeoff. Please correct me if you disagree.

With the concerns you raise, those that remain anyway, maybe learning from Facebook's Katran Load Balancer is the way to go. It uses XDP. It's a project I've been meaning to explore more indepth and is officially listed on the ebpf.io/applications page. Even if your using XDP for something other than load balancing, they thought through a lot of things already it seems.

So really, what this means is that to use libxdp successfully, the container orchestration system needs to make sure each container has its own mount namespace + bpffs instance, and that it stays around for the lifetime of the container.

I'm afraid that this doesn't suffice.

First, lifetime of a container is a really difficult concept. Do we refer to the lifetime of the initial process "inside" the container? Or is it the lifetime of the container as only seen by the container engine (to honor Dan, I'm refraining from using the "d" word here).

Do we refer to the container's file system, now which view, the VFS part in the host and extending down into the pivoted mount namespace of the container? Or...?

And then we have the network namespace (sandbox) that has an independent life cycle, both from the CRI perspective, as well as from Docker's (oh, that's the other "d" word) libnetwork perspective.

The netdev lives in the network namespace (sandbox) life cycle and roams before and after, as long as we're talking especially about "physical" NIC netdevs. If an bpffs instance lives in the container life cycle (one of them), so this seems to make things worse. If it doesn't then the ifindices are moot for unique separation and lookup.

Suffice to say that my use case isn't load balancing. Yet, I still agree with @jontrossbach 😁

I'm just a tidy person who needs containers using libxdp return the system in a clean state (dispatcher is okay, dangling XDP progs and umems are clearly not) so that the system does not leak/block resources when a workload uncleanly exits.

Ah, that's where I took the wrong turn, assuming that the components are bound to the XDP dispatcher program. Now, that components can be reclaimed automatically when the pinning ceases to exist, that would also clean up referenced XSKs, umems, etc. That's good news to me.

Can libxdp at least "recover" when it next finds a netdev with a dispatcher atteched, but there are no pinnings anymore? That is, because of a previous unclean container exit the pinning blew off with the container's private bpffs instance and now the netdev gets reused with a new workload?

Not sure about the mount namespace approximation, as this doesn't fit into CRI's sandbox model. Then only one container in a pod should ever use libxdp ... but that's actually already a limitation if using a container-private bpffs.

Somehow this doesn't feel quite "right" yet and that is because how orchestrators ("k" word here) like to work by outsourcing the network commissioning into CNI. At the time the network for a pod gets commissioned there aren't any containers yet ... except for the sandbox, but that is absolutely implementation dependent. So the orchestrator even hasn't yet any pod-specific mount namespace yet.

So that seems to throw us back to the square labelled "IFINDEX ambiguity".

As I wrote above, configuration of a network namespace is completely separated from "spawning the containers": in Kubernetes, these are the CNI plugins and they get called by the orchestrator with a reference to a network namespace to configure, but only with the sandbox thus being open. At this time, no containers yet. Docker with its libnetwork also opens the sandbox first and completely independently of any containers, and then runs either its internal "plugins" or alternatively one of the registered standalone network plugins.

This discussion (which I highly appreciate!) seems to confirm my initial "suspicion" that this is something to be done in the initial mount namespace while configuring the sandbox network namespace. Now, that brings me back to my initial issue: how to deal with the situation where libxdp has duplicate interface indices. I'm hesitant to create and having to maintain multiple mount namespaces from, say, a CNI plugin.

We already have several resources having cookies, namely, network namespaces as well as containers. Why not having cookies also for netdevs? I don't think we have them at the udev level where we could steal from. The unique cookies would allow libxdp to properly tell netdevs apart wherever it finds them. Would that something that libxdp and the kernel could get in the future, any chance?

Why? Thinking about it some more I definitely don't think it's a good idea for everything to be in the same bpffs: That would mean that pods can trivially access the XDP programs of each others' network devices, which would defeat the whole purpose of having the pod separation, no?

Agreed. But what worries me here that we need to give "highly interesting" capabilities to those workloads, so that their libxdp instances can correctly operate. The personal ebffs instances can be, if I'm not mistaken, requested in deployment using the general (and more powerful) mount settings of Docker and others.

We had much earlier the topic of privilege and potential separation coming up. PROD OPS would, under these design guidelines, enforce breaking libxdp functionality out of the workloads. If this becomes a service, then I'm again facing the situation where the libxdp is in the host, working with multiple network namespaces. This would then require atm to dynamically maintain mount namespaces in case of separated bpffs instances.

That's rather ugly, because the mount namespaces do not go well with multi-threaded programs and services. While there are specific situations (unshare fs attributes in a task) that work and which I already explored in my lxkns "mountineer" objects in Go applications, it's easy to hit the non-working versions and then basically need to run single-threaded child processes and have all the complexities of an additional layer of marshalling data forth and back, including file descriptors.

As you mentioned mount point propagation, or at least I assumed so, I pondered this feature more. But unfortunately I don't see how this would help us. I usually find this to be to introduce quite some complexity, even as I especially developed the mount namespace view in my lxkns tool to give a much better view on mount points (and overmounts) together with their propagation directions and groups than having to deal with plain /proc/PID/mountinfo. So this doesn't look like a help to me atm.

We already have several resources having cookies, namely, network namespaces as well as containers. Why not having cookies also for netdevs? I don't think we have them at the udev level where we could steal from. The unique cookies would allow libydp to properly tell netdevs apart wherever it finds them. Would that something that libxdp and the kernel could get in the future, any chance?
Well, ifindexes are unique in a namespace, and interfaces are explicitly namespaced, so I dunno if a global cookie makes sense? With a compelling use case, it might happen, I suppose.

Would this then be the right place here or is there a mailing list you would recommend for also giving more background information?

Sorry for chiming in so late...

So there's one existing model for using libxdp with an unprivileged container and is being used by the CNDP project. The model is define through the the AF_XDP Device Plugin (DP) and CNI... Where by an external entity (DP) manages the loading and unloading of the BPF program in the host namespace and then shares the map fd with the unprivileged container via UDS...

@kot-begemot-uk and I are currently looking at integrating separate bpffs mount points per container that are injected at pod allocation time as part of an investigation we completed recently. This means that a "unique" mount point per container will be used for pinning bpf objects relevant to that pod/container. These mount points will be managed by the DP. Note: the investigation just used /tmp for POC purposes.

Regarding the ifindex not being unique, the proposed solution in the AF_XDP device plugin is to provision and maintain unique interface names that are used in both the host and the pod, the DP keeps track of these interfaces. That way the interface can be tracked even if the ifindex changes across namespaces. The correct ifindex can be retrieved in any namespace by calling: if_nametoindex(ifname). The device plugin is used to load the eBPF program in the host namespace before the pod is launched. The downward API is used to tell the application what interface has an AF_XDP BPF program loaded on it.

Current behaviour of the Device plugin:
After the CNDP pod is launched, the XSK_MAP FD is passed from the device plugin to the CNDP pod (over a
shared UNIX Domain Socket) so that it can create the AF_XDP sockets that it needs without the need for any extra privileges. This model requires that the netdev name (ifname) is fixed across different networking namespaces so that there is a map of consistent interfaces regardless of ifindex clashes.

Work in progress:
After the CNDP pod is launched, the XSK_MAP FD is accessed from CNDP in the pod via a unique mount point which is a bpffs for that pod that has a pinned xsk_map. That mount point is added to the pod at pod allocation time via the device plugin. Then CNDP can go ahead and create the AF_XDP sockets without the need for Admin privileges. Note: that you will need to set appropriate seccomp policies to allow you to execute BPF syscalls in your container.

We're also looking at integrating the AF_XDP device plugin with BPFd to manage the loading and unloading of the BPF progs for us.

References:

@maryamtahhan whoa, sounds very much for what I was hoping for!