[libxdp] component programs, multiple network namespaces and duplicate netdev indices?

Question

[libxdp] component programs, multiple network namespaces and duplicate netdev indices?

thediveo opened this issue a year ago · comments

Section "Locking and pinning" of "Protocol for atomic loading of multi-prog dispatchers" details the bpffs path names used for pinning. According to documentation, these are in the form /sys/fs/bpf/xdp/dispatch-IFINDEX-DID:

IFINDEX netdev index
DID component program

Is this current scheme safe for use with multiple network namespaces and a single instance of the bpffs? From what I have seen in the past, interface indices aren't unique across network namespaces.

Are there any plans on making libxdp netns-aware?

Toke Høiland-Jørgensen · Answer 1 · Mon Feb 06 2023 21:09:15 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

Is this current scheme safe for use with multiple network namespaces and a single instance of the bpffs? From what I have seen in the past, interface indices aren't unique across network namespaces.

Nope, you're right, that would lead to collisions. The assumption is that each namespace would have its own instance of bpffs. What kind of setup are you using that shares a bpffs across network namespaces?

TheDiveO · Answer 2 · Mon Feb 06 2023 23:44:59 GMT+0800 (China Standard Time)

What would be the recommendation for deploying container workloads that use libxdp?

Toke Høiland-Jørgensen · Answer 3 · Tue Feb 07 2023 01:37:21 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

What would be the recommendation for deploying container workloads that use libxdp?

There's not really any official recommendation, and indeed using XDP with containers is complicated by the fact that it requires privileges to load XDP. So right now the only way of running things is to make the container privileged and just use libxdp inside that container. In which case, making sure the container has its own bpffs should be part of the setup (but I think containers usually are set up with their own mount namespace anyway, so this shouldn't be a big issue?)

Jon Trossbach · Answer 4 · Tue Feb 07 2023 04:38:19 GMT+0800 (China Standard Time)

I don't know that it is fair to say there are no official recommendations about root in containers. Here is Red Hat Podman/container security expert Dan Walsh's article Just Say No to Root in Containers.
https://opensource.com/article/18/3/just-say-no-root-containers

I think the relevant question to ask yourself might be: is there anything about a theoretical XDP container that negates Dan Walsh's arguments? I have a feeling Dan Walsh would probably just say no. But I've not asked him myself.

TheDiveO · Answer 5 · Tue Feb 07 2023 04:59:30 GMT+0800 (China Standard Time)

The discussion about root is mostly moot when it comes to capabilities; the main reason about not using UID 0 is due to the VFS DAC not having aged too well into the container era. My own open source namespace and mount discovery engine lxkns runs as UID!=0 with a small set of highly useful capabilities and has access to the whole system, in combination with the host PID namespace. The root advice is taken out of proportion instead of understood as to what it actually is: one lesser line of defense in a defense in depth design.

Therefore I would rather not like to go off a tangent on Dan's work here. The libxdp kernel paths don't require UID0, they require CAP_NET_RAW for creating an XSK and CAP_EBPF for rather new kernels, or alternatively CAP_SYS_ADMIN for the commonly in-use kernels.

There is the often overlooked pattern of sidecar containers that might be used here for privilege separation.

What worries me instead: if the container gets OOM'd for some reason or -9'ed for some other, then what happens to the things libxdp left behind it, because the netdev won't die but instead return into the initial network namespace? And the pinning's even gone now!

Looking at an old LWN post "Persistent BPF Objects":

So what we have instead is yet another special kernel virtual filesystem. This one is meant to be mounted at /sys/fs/bpf. It is a singleton filesystem, meaning that it can be mounted multiple times within a single namespace and every mount will see the same directory tree. Each mount namespace will, however, get its own version of this filesystem.

Toke Høiland-Jørgensen · Answer 6 · Tue Feb 07 2023 05:27:13 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

What worries me instead: if the container gets OOM'd for some reason or -9'ed for some other, then what happens to the things libxdp left behind it, because the netdev won't die but instead return into the initial network namespace?

Hmm, so if the bpffs is mounted in a separate (mount) namespace and that namespace goes away, so does the bpffs and everything in it. This is actually what 'ip netns exec' does (sets up a new namespace), so if you do, say, 'ip netns exec mynetns xdp-loader load veth0 my_prog.o', you'll end up with just the dispatcher loaded because the command will be executed in its own mount namespace, and the bpffs disappears immediately. The XDP program attached to the interface will survive, but it'll just be an empty dispatcher because the bpf_links keeping the component programs attached will disappear. So really, what this means is that to use libxdp successfully, the container orchestration system needs to make sure each container has its own mount namespace + bpffs instance, and that it stays around for the lifetime of the container. Also, CAP_BPF + CAP_NET_ADMIN is not sufficient to use libxdp, we use get_fd_by_id() in several places, which requires CAP_SYS_ADMIN. I would generally advice people to just treat CAP_BPF as root anyway, there have been way too many privilege escalation attacks to assume otherwise...

TheDiveO · Answer 7 · Tue Feb 07 2023 05:37:19 GMT+0800 (China Standard Time)

It wouldn't be a proper party without CAP_SYS_ADMIN, it seems.

So the dispatcher survives, as it is still claimed by its netdev.

What about the XDP program components loaded into the dispatcher? Aren't these now still claimed by the dispatcher? You say so, but then this part wouldn't be properly reference managed? Or does freplace introduce some new reference/lifetime semantics that allow reclaiming components and the dispatcher not actually reference counting them?

If not then I'm obviously not having a suitable mental model of the things going on here, not that this would surprise me.

Jon Trossbach · Answer 8 · Tue Feb 07 2023 05:46:58 GMT+0800 (China Standard Time)

I would generally advice people to just treat CAP_BPF as root anyway, there have
been way too many privilege escalation attacks to assume otherwise...

Without going into the weeds too much: the most obvious use case to me here is load balancing/BGP. In that case, if your load balancing is going to benefit from XDP a lot then for security's sake maybe just run your load balancers in VMs, separate from the K8s cluster. So, it looks like for the time being, you'll have to way that performance tradeoff. Please correct me if you disagree.

With the concerns you raise, those that remain anyway, maybe learning from Facebook's Katran Load Balancer is the way to go. It uses XDP. It's a project I've been meaning to explore more indepth and is officially listed on the ebpf.io/applications page. Even if your using XDP for something other than load balancing, they thought through a lot of things already it seems.

TheDiveO · Answer 9 · Tue Feb 07 2023 05:53:51 GMT+0800 (China Standard Time)

So really, what this means is that to use libxdp successfully, the container orchestration system needs to make sure each container has its own mount namespace + bpffs instance, and that it stays around for the lifetime of the container.

I'm afraid that this doesn't suffice.

First, lifetime of a container is a really difficult concept. Do we refer to the lifetime of the initial process "inside" the container? Or is it the lifetime of the container as only seen by the container engine (to honor Dan, I'm refraining from using the "d" word here).

Do we refer to the container's file system, now which view, the VFS part in the host and extending down into the pivoted mount namespace of the container? Or...?

And then we have the network namespace (sandbox) that has an independent life cycle, both from the CRI perspective, as well as from Docker's (oh, that's the other "d" word) libnetwork perspective.

The netdev lives in the network namespace (sandbox) life cycle and roams before and after, as long as we're talking especially about "physical" NIC netdevs. If an bpffs instance lives in the container life cycle (one of them), so this seems to make things worse. If it doesn't then the ifindices are moot for unique separation and lookup.

TheDiveO · Answer 10 · Tue Feb 07 2023 05:55:13 GMT+0800 (China Standard Time)

Suffice to say that my use case isn't load balancing. Yet, I still agree with @jontrossbach 😁

I'm just a tidy person who needs containers using libxdp return the system in a clean state (dispatcher is okay, dangling XDP progs and umems are clearly not) so that the system does not leak/block resources when a workload uncleanly exits.

Toke Høiland-Jørgensen · Answer 11 · Tue Feb 07 2023 05:57:11 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

It wouldn't be a proper party withou5 CAP-SYS-ADMIN, it seems.

'fraid not. It would probably be feasible to amend libxdp to be able to run without CAP_SYS_ADMIN, but in its current state it's a requirement.

So the dispatcher survives, as it is still claimed by its netdev.

Yup

What about the fragments (term?) loaded into the dispatcher? Aren't these now still claimed by the dispatcher? If not then I'm obviously not having a suitable mental model of the things going on here, not that this would surprise me.

Those are what I was referring to as the "component programs". And yeah, it would be nice if the programs were kept alive by the dispatcher, but sadly the reference goes the other way. Basically, to build a dispatcher, libxdp does this: - Open the dispatcher object file (- Configure the dispatcher configuration map in memory) - Load the dispatcher program+map into the kernel (the configuration map gets frozen as read-only). - Load each component program in turn, and attach them to their respective slots in the dispatcher using the freplace mechanism. That last operation returns a bpf_link fd for each attachment. A bpf_link is a data structure that refers to a particular attachment of a BPF program. In this case, that's the attachment of the component program to the dispatcher (each program replaces a dummy function that's part of the base dispatcher). This bpf_link fd is ephemeral and the attachment is dismantled if the fd is closed; the only way to prevent this is to pin the link fd to bpffs. So that's the reason libxdp uses pinning. If the bpffs file is removed (and no running program has an open fd referring to the link), the kernel disconnects the attachment.

Toke Høiland-Jørgensen · Answer 12 · Tue Feb 07 2023 06:11:07 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

> So really, what this means is that to use libxdp successfully, the > container orchestration system needs to make sure each container has > its own mount namespace + bpffs instance, and that it stays around > for the lifetime of the container. I'm afraid that this doesn't suffice.

I didn't say it was a *good* solution :) Oh, another option is to instruct libxdp to use a different bpffs path altogether, BTW. The LIBXDP_BPFFS environment variable will achieve this, you can even get libxdp to automount a bpffs instance on the path by setting LIBXDP_BPFFS_AUTOMOUNT=1 as well. But either way, the container orchestration engine has to manage this somehow.

The netdev lives in the network (sandbox) life cycle and roams before and after, as long as we're talking especially about "physical" NIC netdevs. The bpffs instance lives in the container life cycle (one of them), so this seems to make things worse.

Ah, you're talking about moving physical devices into the container? I thought we were talking about veths, so was a bit confused about the whole "the netdev survives" thing... But okay, if that's the deployment model, maybe a better solution would be to make sure the dispatcher *also* disappears if the bpffs does? There's a bpf_link attachment mode for the XDP program itself, which libxdp is not currently using; but if we do, and pin that link along with the others, the whole shebang would dismantle if the mount namespace goes away, so it would neatly tie the XDP program lifetime to that of the container (as long as the container mount namespace is a good enough approximation of the "container lifetime" anyway).

TheDiveO · Answer 13 · Tue Feb 07 2023 06:12:59 GMT+0800 (China Standard Time)

Ah, that's where I took the wrong turn, assuming that the components are bound to the XDP dispatcher program. Now, that components can be reclaimed automatically when the pinning ceases to exist, that would also clean up referenced XSKs, umems, etc. That's good news to me.

Can libxdp at least "recover" when it next finds a netdev with a dispatcher atteched, but there are no pinnings anymore? That is, because of a previous unclean container exit the pinning blew off with the container's private bpffs instance and now the netdev gets reused with a new workload?

TheDiveO · Answer 14 · Tue Feb 07 2023 06:21:21 GMT+0800 (China Standard Time)

Not sure about the mount namespace approximation, as this doesn't fit into CRI's sandbox model. Then only one container in a pod should ever use libxdp ... but that's actually already a limitation if using a container-private bpffs.

Somehow this doesn't feel quite "right" yet and that is because how orchestrators ("k" word here) like to work by outsourcing the network commissioning into CNI. At the time the network for a pod gets commissioned there aren't any containers yet ... except for the sandbox, but that is absolutely implementation dependent. So the orchestrator even hasn't yet any pod-specific mount namespace yet.

So that seems to throw us back to the square labelled "IFINDEX ambiguity".

Toke Høiland-Jørgensen · Answer 15 · Tue Feb 07 2023 06:24:03 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

Ah, that's where I took the wrong turn, assuming that fragments are bound to the XDP dispatcher program. Now, that components can be reclaimed automatically when the pinning ceases to exist, that would also clean up referenced XSKs, umems, etc. That's good news to me.

Hmm, not sure what happens to the XSKs actually. Those will probably be kept alive by the userspace process? But presumably if the whole container dies, so do the userspace program, in which case, yeah, everything should basically be cleaned up.

Can libxdp at least "recover" when it next finds a netdev with a dispatcher atteched, but there are no pinnings anymore? That is, because of a previous unclean container exit the pinning blew off with the container's private bpffs instance and now the netdev gets reused with a new workload?

Not transparently: in this case libxdp will basically treat the dispatcher as if it were any other XDP program attached without using libxdp. I.e., you can see it's there and you can explicitly detach it, but if you just try to attach another program using libxdp it won't automatically remove the existing program.

Toke Høiland-Jørgensen · Answer 16 · Tue Feb 07 2023 06:24:54 GMT+0800 (China Standard Time)

Jon Trossbach ***@***.***> writes:

Without going into the weeds too much: the most obvious use case to me here is load balancing/BGP. In that case, if your load balancing is going to benefit from XDP a lot then for security's sake maybe just run your load balancers in VMs, separate from the K8s cluster. So, it looks like for the time being, you'll have to way that performance tradeoff. Please correct me if you disagree.

Well, you're right in the sense that in a containerised environment the most obvious use cases for XDP are those that run in the host namespace. I.e., filtering, load balancing, packet encapsulation etc which would logically be part of the CNI, which is at trusted component anyway. I'm not sure moving those into a VM gains you much: a load balancer is a MITM for all your traffic, so there's a certain amount of trust implicit in running it anyway. But really, at least for the time being, thinking you can isolate anything BPF using only containers is probably unrealistic: it all runs in the same kernel, after all.

With the concerns you raise, those that remain anyway, maybe learning from Facebook's [Katran Load Balancer](https://engineering.fb.com/2018/05/22/open-source/open-sourcing-katran-a-scalable-network-load-balancer/) is the way to go. It uses XDP. It's a project I've been meaning to explore more indepth and is officially listed on the [ebpf.io/applications](https://ebpf.io/applications) page. Even if your using XDP for something other than load balancing, they thought through a lot of things already it seems.

Katran is made to run on bare metal AFAIK, so I don't think it solves any of these issues...

Toke Høiland-Jørgensen · Answer 17 · Tue Feb 07 2023 06:34:15 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

Not sure about the mount namespace approximation, as this doesn't fit into CRI's sandbox model. Then only one container in a pod should ever use libxdp ... but that's actually already a limitation if using a container-private bpffs. Somehow this doesn't feel quite "right" yet and that is because how orchestrators ("k" word here) like to work by outsourcing the network commissioning into CNI. At the time the network for a pod gets commissioned there aren't any containers yet ... except for the sandbox, but that is absolutely implementation dependent. So the orchestrator even hasn't yet any pod-specific mount namespace yet.

Hmm, which component is responsible for spawning the containers in the right network namespace? And could that component not make sure that all containers share a bpffs instance? I'm not an expert on namespacing by any means, but I believe it's possible to share mounts across different mount namespaces (assuming each container gets their own mount namespace)? That way there would always be a 1-to-1 mapping between network namespaces and bpffs instances, which would resolve the ifindex ambiguity issue...

TheDiveO · Answer 18 · Tue Feb 07 2023 15:31:32 GMT+0800 (China Standard Time)

As I wrote above, configuration of a network namespace is completely separated from "spawning the containers": in Kubernetes, these are the CNI plugins and they get called by the orchestrator with a reference to a network namespace to configure, but only with the sandbox thus being open. At this time, no containers yet. Docker with its libnetwork also opens the sandbox first and completely independently of any containers, and then runs either its internal "plugins" or alternatively one of the registered standalone network plugins.

This discussion (which I highly appreciate!) seems to confirm my initial "suspicion" that this is something to be done in the initial mount namespace while configuring the sandbox network namespace. Now, that brings me back to my initial issue: how to deal with the situation where libxdp has duplicate interface indices. I'm hesitant to create and having to maintain multiple mount namespaces from, say, a CNI plugin.

We already have several resources having cookies, namely, network namespaces as well as containers. Why not having cookies also for netdevs? I don't think we have them at the udev level where we could steal from. The unique cookies would allow libxdp to properly tell netdevs apart wherever it finds them. Would that something that libxdp and the kernel could get in the future, any chance?

Toke Høiland-Jørgensen · Answer 19 · Wed Feb 08 2023 04:07:55 GMT+0800 (China Standard Time)

TheDiveO ***@***.***> writes:

This discussion (which I highly appreciate!) seems to confirm my initial "suspicion" that this is something to be done in the initial mount namespace while configuring the sandbox network namespace. Now, that brings me back to my initial issue: how to deal with the situation where libxdp has duplicate interface indices. I'm hesitant to create and having to maintain multiple mount namespaces from, say, a CNI plugin.

Why? Thinking about it some more I definitely don't think it's a good idea for everything to be in the same bpffs: That would mean that pods can trivially access the XDP programs of each others' network devices, which would defeat the whole purpose of having the pod separation, no?

We already have several resources having cookies, namely, network namespaces as well as containers. Why not having cookies also for netdevs? I don't think we have them at the udev level where we could steal from. The unique cookies would allow libydp to properly tell netdevs apart wherever it finds them. Would that something that libxdp and the kernel could get in the future, any chance?

Well, ifindexes are unique in a namespace, and interfaces are explicitly namespaced, so I dunno if a global cookie makes sense? With a compelling use case, it might happen, I suppose.

TheDiveO · Answer 20 · Wed Feb 08 2023 05:22:10 GMT+0800 (China Standard Time)

Why? Thinking about it some more I definitely don't think it's a good idea for everything to be in the same bpffs: That would mean that pods can trivially access the XDP programs of each others' network devices, which would defeat the whole purpose of having the pod separation, no?

Agreed. But what worries me here that we need to give "highly interesting" capabilities to those workloads, so that their libxdp instances can correctly operate. The personal ebffs instances can be, if I'm not mistaken, requested in deployment using the general (and more powerful) mount settings of Docker and others.

We had much earlier the topic of privilege and potential separation coming up. PROD OPS would, under these design guidelines, enforce breaking libxdp functionality out of the workloads. If this becomes a service, then I'm again facing the situation where the libxdp is in the host, working with multiple network namespaces. This would then require atm to dynamically maintain mount namespaces in case of separated bpffs instances.

That's rather ugly, because the mount namespaces do not go well with multi-threaded programs and services. While there are specific situations (unshare fs attributes in a task) that work and which I already explored in my lxkns "mountineer" objects in Go applications, it's easy to hit the non-working versions and then basically need to run single-threaded child processes and have all the complexities of an additional layer of marshalling data forth and back, including file descriptors.

As you mentioned mount point propagation, or at least I assumed so, I pondered this feature more. But unfortunately I don't see how this would help us. I usually find this to be to introduce quite some complexity, even as I especially developed the mount namespace view in my lxkns tool to give a much better view on mount points (and overmounts) together with their propagation directions and groups than having to deal with plain /proc/PID/mountinfo. So this doesn't look like a help to me atm.

We already have several resources having cookies, namely, network namespaces as well as containers. Why not having cookies also for netdevs? I don't think we have them at the udev level where we could steal from. The unique cookies would allow libydp to properly tell netdevs apart wherever it finds them. Would that something that libxdp and the kernel could get in the future, any chance?
Well, ifindexes are unique in a namespace, and interfaces are explicitly namespaced, so I dunno if a global cookie makes sense? With a compelling use case, it might happen, I suppose.

Would this then be the right place here or is there a mailing list you would recommend for also giving more background information?

Maryam Tahhan · Answer 21 · Thu Feb 09 2023 19:52:06 GMT+0800 (China Standard Time)

Sorry for chiming in so late...

So there's one existing model for using libxdp with an unprivileged container and is being used by the CNDP project. The model is define through the the AF_XDP Device Plugin (DP) and CNI... Where by an external entity (DP) manages the loading and unloading of the BPF program in the host namespace and then shares the map fd with the unprivileged container via UDS...

@kot-begemot-uk and I are currently looking at integrating separate bpffs mount points per container that are injected at pod allocation time as part of an investigation we completed recently. This means that a "unique" mount point per container will be used for pinning bpf objects relevant to that pod/container. These mount points will be managed by the DP. Note: the investigation just used /tmp for POC purposes.

Regarding the ifindex not being unique, the proposed solution in the AF_XDP device plugin is to provision and maintain unique interface names that are used in both the host and the pod, the DP keeps track of these interfaces. That way the interface can be tracked even if the ifindex changes across namespaces. The correct ifindex can be retrieved in any namespace by calling: if_nametoindex(ifname). The device plugin is used to load the eBPF program in the host namespace before the pod is launched. The downward API is used to tell the application what interface has an AF_XDP BPF program loaded on it.

Current behaviour of the Device plugin:
After the CNDP pod is launched, the XSK_MAP FD is passed from the device plugin to the CNDP pod (over a
shared UNIX Domain Socket) so that it can create the AF_XDP sockets that it needs without the need for any extra privileges. This model requires that the netdev name (ifname) is fixed across different networking namespaces so that there is a map of consistent interfaces regardless of ifindex clashes.

Work in progress:
After the CNDP pod is launched, the XSK_MAP FD is accessed from CNDP in the pod via a unique mount point which is a bpffs for that pod that has a pinned xsk_map. That mount point is added to the pod at pod allocation time via the device plugin. Then CNDP can go ahead and create the AF_XDP sockets without the need for Admin privileges. Note: that you will need to set appropriate seccomp policies to allow you to execute BPF syscalls in your container.

We're also looking at integrating the AF_XDP device plugin with BPFd to manage the loading and unloading of the BPF progs for us.

References:

https://networkbuilders.intel.com/solutionslibrary/cloud-native-data-plane-cndp-overview-technology-guide

TheDiveO · Answer 22 · Thu Feb 09 2023 20:05:14 GMT+0800 (China Standard Time)

@maryamtahhan whoa, sounds very much for what I was hoping for!