on Linux, search order for layers isn't consistent with Vulkan and XDG basedir spec

Question

on Linux, search order for layers isn't consistent with Vulkan and XDG basedir spec

smcv opened this issue 8 months ago · comments

The search order for layers (and possibly runtimes) on Linux seems to have been copied from Vulkan before KhronosGroup/Vulkan-Loader#246 and KhronosGroup/Vulkan-Loader#245 were fixed in KhronosGroup/Vulkan-Loader#655. Previous discussion of this is mostly in KhronosGroup/Vulkan-Loader#245.

It would be good if OpenXR was consistent with the Vulkan, and the XDG base directory specification that Vulkan now obeys. Concretely, I believe this would mean changing ReadDataFilesInSearchPaths:

the single directory $XDG_CONFIG_HOME (defaulting to ~/.config if unset) should be searched before (higher precedence than) the first item in $XDG_CONFIG_DIRS, but it is not currently searched at all;
the single directory $XDG_DATA_HOME (defaulting to ~/.local/share) should be searched before (higher precedence than) the first item in $XDG_DATA_DIRS, but it is currently searched after (lower precedence than) the last item in $XDG_DATA_DIRS

Note that $XDG_CONFIG_HOME is already searched for active_runtime(.<arch>).json in FindXDGConfigFile, with the correct ordering/precedence relative to $XDG_CONFIG_DIRS.

The Vulkan loader has a really nice summary of how layers and drivers are located, in https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderLayerInterface.md#linux-layer-discovery and https://github.com/KhronosGroup/Vulkan-Loader/blob/main/docs/LoaderDriverInterface.md#driver-discovery-on-linux (and they are consistent with each other). I'd suggest following similar conventions (but with suffixes like vulkan/implicit_layer.d replaced by an appropriate OpenXR equivalent) rather than having OpenXR do its own thing.

If I'm reading the C++ code correctly, I believe the two changes I suggested above would mean that searching for layers and runtimes is fully consistent with searching for Vulkan layers and drivers (searching "the config stack" followed by "the data stack"), while searching for active_runtime(.<arch>).json would be consistent with Vulkan up until the end of the "config stack", but without proceeding into the "data stack" if it is not found in a config directory.

One difference between Vulkan and OpenXR where OpenXR does have the advantage is that OpenXR describes a vocabulary of CPU architecture identifiers to look for active_runtime.<arch>.json.

Simon McVittie · Answer 1 · Thu Feb 01 2024 03:09:54 GMT+0800 (China Standard Time)

The reason I'm interested in this is that if we want it to be possible to use OpenXR in Steam Linux Runtime containers (ValveSoftware/steam-runtime#575) with OpenXR layers or runtimes that are installed system-wide from distribution packages like .deb or .rpm, we probably need to teach the Steam Linux Runtime code to locate its layers and drivers, exactly like it does already for Vulkan. I'd very much prefer to be able to reuse the Vulkan code with a few strings changed, rather than having to construct a whole different search path for OpenXR!

Rylie Pavlik · Answer 2 · Thu Feb 01 2024 04:22:42 GMT+0800 (China Standard Time)

OpenXR describes a vocabulary of CPU architecture identifiers to look for

yeah I based that in part on Debian 😉 I thought Vulkan was doing that intentionally based on my Debian system, but it appears that Mesa is just decorating its manifests in a similar way, and VUlkan is just trying to load them all, which sometimes fails because it's the wrong arch. OpenXR only has a single active runtime loaded by the loader, unlike vulkan which might have multiple ICD's simultaneously, so we can't necessarily do the same thing so we had to make that table.

Anyway...

I'm pretty confident we're searching XDG_CONFIG_HOME for runtimes, but it's entirely possible we've messed it up on the layer part. The code that does this hurts my brain a bit, it's not how I would have architected it, so I tend not to poke at it unless it's clearly broken and nobody else steps up. I'll get this synced into the gitlab and we'll see if anyone who isn't me wants to step up and do it ;)

Here's the OpenXR loader doc in case you hadn't found it. https://registry.khronos.org/OpenXR/specs/1.0/loader.html

Simon McVittie · Answer 3 · Thu Feb 01 2024 20:33:19 GMT+0800 (China Standard Time)

I thought Vulkan was doing that intentionally based on my Debian system, but it appears that Mesa is just decorating its manifests in a similar way, and VUlkan is just trying to load them all, which sometimes fails because it's the wrong arch

Yes. There are basically two ways that a Vulkan driver or layer (I'll say "module" as a generic term) can work reliably:

Per architecture, like Mesa: install one JSON manifest per architecture (one per ABI, technically), with a library_path that is an absolute path, like /usr/lib/x86_64-linux-gnu/libvulkan_lvp.so on Debian. Vulkan-Loader tries to pass all of the absolute paths to dlopen(). For the ones that are the matching ABI, it works. The ones that are the wrong ABI (like /usr/lib/i386-linux-gnu/libvulkan_lvp.so for a 32-bit process) fail to load and are ignored.
Single shared manifest, like Nvidia: install a single JSON manifest which is shared between all ABIs, with a library_path that is just a basename, like libGLX_nvidia.so.0. Vulkan-Loader passes the basename to dlopen(), and it resolves to a different filename per architecture, typically /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0 for a 64-bit process on Debian.

For non-x86 or non-Debian, replace /usr/lib/x86_64-linux-gnu and /usr/lib/i386-linux-gnu with whatever are the appropriate ${libdir} values for the architecture/distro pairs you're using.

As a variation of the per-architecture manifest, the library_path can be relative to the JSON manifest: that's just an absolute path with extra steps.

As a variation of the single shared manifest, the library_path can be an absolute or relative path that contains dynamic string tokens like ${LIB}. This can work acceptably well if you know what single Linux distribution you're installing onto and what their convention for choosing a ${libdir} is, but upstreams rarely get that luxury.

The per-architecture manifest is the one that can work somewhat reliably if your module might be installed into a non-default prefix like ~/.local or /opt/xr or something. The single shared manifest only really works if you install it into /usr, or maybe sometimes /usr/local, so that your module is reliably found by ld.so without needing special mechanisms - this is fine for Nvidia because their driver needs core-OS-level integration anyway for the kernel module, and would be fine for Mesa if users and developers didn't sometimes want to install a custom/patched/bleeding-edge Mesa into a non-default prefix, but I think OpenXR is unlikely to be at that level of OS integration yet.

OpenXR only has a single active runtime loaded by the loader, unlike vulkan which might have multiple ICD's simultaneously

Yes: if you want to be able to designate a single runtime (per architecture) and not even dlopen() the others, and your installation mechanism is using per-architecture manifests, then I agree you need the per-architecture config file.

A more Vulkan-like approach to this would be to allow dlopen()ing all the runtimes and querying them for what they support, and then have some higher-level mechanism to select one of them to be actively used and leave all the others idle. But that way you could potentially end up with conflicting support libraries in your address space, so I can see why you might not want this.

If I'm reading correctly, OpenXR runtimes are unique (only one loaded) unlike Vulkan, but OpenXR layers work like Vulkan drivers and layers (all loaded at once).

Simon McVittie · Answer 4 · Thu Feb 01 2024 20:36:27 GMT+0800 (China Standard Time)

The ones that are the wrong ABI (like /usr/lib/i386-linux-gnu/libvulkan_lvp.so for a 32-bit process) fail to load and are ignored.

Recent versions of Vulkan-Loader do have a minor optimization for this: you can declare a library_arch in the JSON manifest, which is an arbitrary string but in practice takes values 32 and 64 (possibly 128 for CHERI). If you do, and it doesn't match the sizeof (void *) of the current process, then that JSON manifest is ignored, which optimizes away the step where the loader would waste time trying and failing to dlopen() a different architecture's library.

Simon McVittie · Answer 5 · Thu Feb 01 2024 20:47:50 GMT+0800 (China Standard Time)

I'm pretty confident we're searching XDG_CONFIG_HOME for runtimes, but it's entirely possible we've messed it up on the layer part. The code that does this hurts my brain a bit, it's not how I would have architected it, so I tend not to poke at it unless it's clearly broken and nobody else steps up. I'll get this synced into the gitlab and we'll see if anyone who isn't me wants to step up and do it ;)

If I'm reading correctly, https://github.com/KhronosGroup/OpenXR-SDK-Source/blob/release-1.0.33/src/loader/manifest_file.cpp#L180 has the bug that I described. In the block starting at https://github.com/KhronosGroup/OpenXR-SDK-Source/blob/release-1.0.33/src/loader/manifest_file.cpp#L203, handling of $XDG_CONFIG_HOME is missing, and xdg_data_home should be moved up so it comes between EXTRASYSCONFDIR and xdg_data_dirs.

I'm unable to test this (I don't have any XR hardware) so this is all from source code inspection.

For runtimes, https://github.com/KhronosGroup/OpenXR-SDK-Source/blob/release-1.0.33/src/loader/manifest_file.cpp#L310 looks as though it correctly searches $XDG_CONFIG_HOME first, followed by $XDG_CONFIG_DIRS, SYSCONFDIR and EXTRASYSCONFDIR. That is the search order I would expect for the "config stack". If you're intentionally only searching the "config stack" for this, and not the "data stack", then it looks right.

The "stateless distro"/"empty /etc" movement (trying to make an unconfigured system have an empty /etc, so that every file in /etc represents an intentional sysadmin change) would probably want to ask you to search for an active runtime in the "data stack" if the active_runtime(.<arch>).json isn't found in the "config stack", so that a distro can configure a system-wide default XR runtime in /usr to be used if not otherwise configured.

For my future reference, is "the gitlab" somewhere that I should have opened this issue instead, or is it private to Khronos members?

Rylie Pavlik · Answer 6 · Thu Feb 01 2024 23:52:08 GMT+0800 (China Standard Time)

"the gitlab" is a Khronos private monorepo, yes. I imagine you should have access but it's fine to file stuff here too. Frankly, there are more people who might help here.

Rylie Pavlik's Robot Assistant · Answer 7 · Thu Feb 01 2024 23:54:12 GMT+0800 (China Standard Time)

An issue (number 2209) has been filed to correspond to this issue in the internal Khronos GitLab (Khronos members only: KHR:openxr/openxr#2209 ), to facilitate working group processes.

This GitHub issue will continue to be the main site of discussion.