ValveSoftware / steam-runtime

A runtime environment for Steam applications

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regression with glibc 2.39 prereleases: fails to find shared libraries after ldconfig

gulafaran opened this issue · comments

its been a few weeks/months since i last played games using steam so im not entirerly sure what the reason for the error is, but pretty much any game i try to launch just errors with

error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory

and as a temporary "test/workaround" i appended at the top of _v2-entry-point

shift 2
exec "${@}"

simply to test run the games without the runtime, and then they launch again.

Your system information

  • Steam Runtime Version: steam tools sections says "Steam Linux Runtime 3.0 (sniper)"
  • Distribution (e.g. Ubuntu 18.04): Arch Linux
  • Link to your full system information (Help -> Steam Runtime Diagnostics) in a Gist: https://gist.github.com/gulafaran/af9c91b01f8cd2578ea0e4e1f9267e64
  • Have you checked for system updates?: Yes
  • What compatibility tool are you using?: tried all proton version, even the glorious eggroll ones. same deal
  • What versions are listed in steamapps/common/SteamLinuxRuntime/VERSIONS.txt? doesnt seem to exist.
  • What versions are listed in steamapps/common/SteamLinuxRuntime_soldier/VERSIONS.txt? doesnt seem to exist.
  • What versions are listed in steamapps/common/SteamLinuxRuntime_sniper/VERSIONS.txt? https://gist.github.com/gulafaran/411fc6f60d546cd5537528709ca0b56b

STEAM_LINUX_RUNTIME_VERBOSE=1 log. where i launch steam, try to run a game and exit steam. error can be seen at line 8164
https://gist.github.com/gulafaran/798a234f287fdb040eee6b1249879569

Steps for reproducing this issue:

  1. start steam-native or steam-runtime
  2. try to launch any game and it errors out with "error while loading shared libraries: libdl.so.2: cannot open shared object file: No such file or directory"

Extra

worth mentioning is that i run quite a few -git packages as in glibc and other things it might just be a locally screwed up situation. but thought id make a bugreport and let you with more knowledge take a look.

It's meant to have picked up the libdl.so.2 from your host system as a replacement for the one originally in the runtime, which is necessary to make newer graphics drivers work:

pressure-vessel-wrap[9153]: D: overrides/lib/i386-linux-gnu/libdl.so.2 points to container-side path /run/host/usr/lib32/libdl.so.2
...
pressure-vessel-wrap[9153]: D: overrides/lib/x86_64-linux-gnu/libdl.so.2 points to container-side path /run/host/usr/lib/libdl.so.2

i run quite a few -git packages as in glibc and other things

First question: is there anything unusual about /usr/lib32/libdl.so.2 and /usr/lib/libdl.so.2 on your system? In particular, are there any symbolic links involved in their paths?

Normally I would expect that /usr, /usr/lib32 and /usr/lib are real directories (not symbolic links), and libdl.so.2 is either a regular file (not a symlink), or a relative symlink to a file with a versioned name like libdl-2.37.so in the same directory.

In your log, this doesn't seem right:

/usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu: (from /run/pressure-vessel/ldso/ld.so.conf:1)
	libGLX_mesa.so.0 -> libGLX_indirect.so.0
/usr/lib/pressure-vessel/overrides/lib/i386-linux-gnu: (from /run/pressure-vessel/ldso/ld.so.conf:2)

... because I would have expected lots of library symlinks in each of those directories inside the container, notably libdl.so.2 (further up the log, we saw that being created).

Because you mentioned you're using glibc from git, I wonder whether there has been a behaviour change in ldconfig?

Another interesting piece of information is that we were able to start a process inside the container:

pressure-vessel-wrap[9519]: D: Replacing self with pv-bwrap...
pressure-vessel-adverb[9692]: N: Enabled profiling
pressure-vessel-adverb[9692]: D: Setting up to exit when parent does
pressure-vessel-adverb[9692]: D: Found _srt_find_myself() in main executable /usr/lib/pressure-vessel/from-host/bin/pressure-vessel-adverb

That early part of the in-container code relies on LD_LIBRARY_PATH to find libraries, because at that point we haven't yet had the opportunity to set up the ld.so.cache correctly. But then, after transitioning from LD_LIBRARY_PATH to relying on ld.so.cache (which we do to avoid trouble with games that reset the LD_LIBRARY_PATH), the next executable that we launch after that can no longer find its required libraries (and the one you get an error for happens to be libdl.so.2, but I think that might just be coincidence).

I can confirm that it's definitely an issue with bleeding edge glibc. I got the same issue on Fedora Rawhide after I updated to glibc-2.38.9000-19.

By the way, how did you disable the runtime? I had to resort to using the Fl*tpak!

I can confirm that it's definitely an issue with bleeding edge glibc. I got the same issue on Fedora Rawhide after I updated to glibc-2.38.9000-19.

Thanks, that narrows this down. Are you able to find out what glibc commit that version corresponds to?

By the way, how did you disable the runtime?

Disabling the runtime is not supportable, so I would prefer it not to be something that is copy/pasted around as a recipe by users who have not worked it out for themselves and do not understand its implications.

I can confirm that it's definitely an issue with bleeding edge glibc. I got the same issue on Fedora Rawhide after I updated to glibc-2.38.9000-19.

Thanks, that narrows this down. Are you able to find out what glibc commit that version corresponds to?

Here you go!

Nothing in there jumps out at me as a particularly likely root cause. Maybe an unintended regression in elf: ldconfig should skip temporary files created by package managers? I'm looking into it.

@Nanotwerp, does your verbose log look the same as what @gulafaran reported? I'm particularly interested in the part starting from the first mention of pressure-vessel-adverb[xxx]: D: regenerate_ld_so_cache.

What seems to be going on here is:

Steps to reproduce

  • Have a glibc 2.39 prerelease from git
  • Launch a game that uses any Steam Linux Runtime version, either a Windows game via Proton or a native Linux game that uses SLR (for example I expect that CS2, Dota 2, Endless Sky, Retroarch will be affected)

Expected result

pressure-vessel-adverb starts up successfully, with LD_LIBRARY_PATH temporarily set to: /usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu:/usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu/aliases:/usr/lib/pressure-vessel/overrides/lib/i386-linux-gnu:/usr/lib/pressure-vessel/overrides/lib/i386-linux-gnu/aliases.

It should run ldconfig to regenerate ld.so.cache, then change LD_LIBRARY_PATH to just /usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu/aliases:/usr/lib/pressure-vessel/overrides/lib/i386-linux-gnu/aliases and exec the next program in the chain, normally steam-runtime-launcher-interface-0, which should start successfully.

(These are paths inside the container, it is normal that they don't exist on the host system.)

Actual result

pressure-vessel-adverb starts up successfully, as above.

After it runs ldconfig, for a reason that is not yet understood, the libraries in /usr/lib/pressure-vessel/overrides/lib/*-linux-gnu are no longer found. It's not yet clear whether this is because they weren't found by ldconfig and added to the new ld.so.cache, or because they weren't successfully read back from the new ld.so.cache.

[Tracked as steamrt/tasks#357 internally.]

It's meant to have picked up the libdl.so.2 from your host system as a replacement for the one originally in the runtime, which is necessary to make newer graphics drivers work:

pressure-vessel-wrap[9153]: D: overrides/lib/i386-linux-gnu/libdl.so.2 points to container-side path /run/host/usr/lib32/libdl.so.2
...
pressure-vessel-wrap[9153]: D: overrides/lib/x86_64-linux-gnu/libdl.so.2 points to container-side path /run/host/usr/lib/libdl.so.2

i run quite a few -git packages as in glibc and other things

First question: is there anything unusual about /usr/lib32/libdl.so.2 and /usr/lib/libdl.so.2 on your system? In particular, are there any symbolic links involved in their paths?

Normally I would expect that /usr, /usr/lib32 and /usr/lib are real directories (not symbolic links), and libdl.so.2 is either a regular file (not a symlink), or a relative symlink to a file with a versioned name like libdl-2.37.so in the same directory.

In your log, this doesn't seem right:

/usr/lib/pressure-vessel/overrides/lib/x86_64-linux-gnu: (from /run/pressure-vessel/ldso/ld.so.conf:1)
	libGLX_mesa.so.0 -> libGLX_indirect.so.0
/usr/lib/pressure-vessel/overrides/lib/i386-linux-gnu: (from /run/pressure-vessel/ldso/ld.so.conf:2)

... because I would have expected lots of library symlinks in each of those directories inside the container, notably libdl.so.2 (further up the log, we saw that being created).

Because you mentioned you're using glibc from git, I wonder whether there has been a behaviour change in ldconfig?

lib32-libdl

 ┌┤acer->tom ~
 └➤ pacman -Qo /usr/lib32/libdl.so.2 
/usr/lib32/libdl.so.2 is owned by lib32-glibc-git 2.38.r234.gf957f47df7-1
 ┌┤acer->tom ~
 └➤ file /usr/lib32/libdl.so.2 
/usr/lib32/libdl.so.2: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, BuildID[sha1]=f15640e8feae9612e4fbec4a15ec8c4a5dab3739, for GNU/Linux 5.15.0, stripped
 ┌┤acer->tom ~
 └➤ ls -la /usr/lib32/libdl.so.2 
-rwxr-xr-x 1 root root 13572  4 nov 14.04 /usr/lib32/libdl.so.2

libdl

 ┌┤acer->tom ~
 └➤ file /usr/lib/libdl.so.2 
/usr/lib/libdl.so.2: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=0140d0a2298c0bc29a50d217c2c8d339a74d14c7, for GNU/Linux 5.15.0, stripped
 ┌┤acer->tom ~
 └➤ ls -la /usr/lib/libdl.so.2
-rwxr-xr-x 1 root root 14176  4 nov 14.04 /usr/lib/libdl.so.2
 ┌┤acer->tom ~
 └➤ pacman -Qo /usr/lib/libdl.so.2 
/usr/lib/libdl.so.2 is owned by glibc-git 2.38.r234.gf957f47df7-1

and yeah both /usr , /usr/lib and /usr/lib32 are real directories and isnt residing on some multiple partition layout its all one big / "root" mount point.

@Nanotwerp, does your verbose log look the same as what @gulafaran reported? I'm particularly interested in the part starting from the first mention of pressure-vessel-adverb[xxx]: D: regenerate_ld_so_cache.

Mine seems to be pretty close to @gulafaran's.
https://gist.github.com/Nanotwerp/b2b8fdf7aa5e8b9228c35bdd26769d29

@kisak-valve, please could you retitle this to "regression with glibc 2.39 prereleases: fails to find shared libraries after ldconfig" or something, to narrow down its scope a bit?

@Nanotwerp reported that the library that wasn't found is libc.so.6 rather than libdl.so.2, but their log is otherwise similar, which I think means the specific library that we don't find is not the important factor.

@Nanotwerp reported that the library that wasn't found is libc.so.6 rather than libdl.so.2, but their log is otherwise similar, which I think means the specific library that we don't find is not the important factor.

Oh, I didn't even see that the missing library was changed! Mine was also libdl.so.2 when I first encountered the error.

ValveSoftware/steam-for-linux#10209 confirms that elf: ldconfig should skip temporary files created by package managers is the change in behavior with glibc.

Nothing in there jumps out at me as a particularly likely root cause. Maybe an unintended regression in elf: ldconfig should skip temporary files created by package managers? I'm looking into it.

@Nanotwerp, does your verbose log look the same as what @gulafaran reported? I'm particularly interested in the part starting from the first mention of pressure-vessel-adverb[xxx]: D: regenerate_ld_so_cache.

Yes, this is. I found this commit by bisect.
And I convinced that after reverting it the steam runtime start working again.

https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=2aa0974d2573441bffd596b07bff8698b1f2f18c

UPD: It fixed here https://inbox.sourceware.org/libc-alpha/87v8a83f9t.fsf@oldenburg.str.redhat.com/

UPD: It fixed here https://inbox.sourceware.org/libc-alpha/87v8a83f9t.fsf@oldenburg.str.redhat.com/

Thanks for locating that! This indicates that it's a bug in the version of ldconfig provided by (some) glibc 2.39 prerelease snapshots, which needs fixing in glibc rather than in the Steam Linux Runtime. The most that the Steam Linux Runtime would be able to do about this would be a workaround.

The least-bad workaround I can think of right now would slightly slow down launching all games for all users (because we'd have to add a check in the critical path to make sure ldconfig is not the broken version), and would work around the glibc bug for all Proton games and most (but not all!) native Linux games. I would prefer not to have the unconditional slowdown if we can avoid it - I'm hoping that this will be fixed in glibc soon, so that we can assume that the broken versions basically don't exist.

@Nanotwerp: In Fedora, this should have been fixed by glibc-2.38.9000-21.fc40 (I have not yet confirmed this). Please try that version?

@gulafaran: I don't know where you got your glibc git snapshots from (https://aur.archlinux.org/packages/glibc-git seems to be outdated and https://gitlab.archlinux.org/archlinux/packaging/packages/glibc doesn't seem to have a branch that is newer than the stable release), but please try applying the patch @NTMan linked above, which is also available at https://sourceware.org/pipermail/libc-alpha/2023-November/152711.html.

Replying to #630 (comment)

(https://aur.archlinux.org/packages/glibc-git) aur is a bit manual when using arch, the version listed there is simply what the (almost bash build script) version of the commit was when uploading the PKGBUILD there, whenever i run makepkg on it it will fetch glibc directly from git and build master and update PKGVER locally. however that commit is not in the git repo yet https://sourceware.org/git/?p=glibc.git;a=shortlog but it doesnt stop me from applying said patch anyhow when rebuilding it :)

and with the patch applied, the games runs again.

The glibc bug that caused this issue made some systems unbootable, and apparently the Fedora rawhide update that fixed it also introduced an unrelated qsort regression, so I would suggest staying with a stable release of glibc until the dust has settled.

Or, if you particularly want to be using the latest unreleased development snapshots of glibc at this early stage in their release cycle (for the adventurous only), sorry but it will be necessary to keep track of regressions like this one and apply workarounds/fixes if necessary. There are limits to what pressure-vessel/Steam Linux Runtime can do to insulate you from regressions in core system components like glibc.

Replying to #630 (comment)

surely, sometimes running git of things like this catches regressions but it sometimes also catches changes that "downstream" might need to adapt to. this time it was an regression, but il continue my adventurous building and keep reporting :D eventually il catch something that isnt an regression i guess.

While trying to reproduce this on Fedora (with the broken glibc version) I've found that the same bug can also result in pressure-vessel-adverb just crashing on startup. That happens earlier than it's possible for a workaround to take effect, as a result of libdl.so.2 being missing from the host /etc/ld.so.cache (which means we end up trying to use the host libc.so.6 with Debian 11's libdl.so.2, which is incompatible).

I think that confirms that this is something that needs to be fixed in glibc, and can't usefully be worked around in pressure-vessel.

For Fedora users, glibc-2.38.9000-22.fc40 should resolve this.

For AUR or otherwise self-compiled users, the latest glibc git should hopefully resolve this: specifically, today's commit cfb5a97a "ldconfig: Fixes for skipping temporary files" is the one that's needed.

For AUR or otherwise self-compiled users, the latest glibc git should hopefully resolve this: specifically, today's commit cfb5a97a "ldconfig: Fixes for skipping temporary files" is the one that's needed.

yep does here, should i close the issue?

Yes, please close it. I think we can consider this to be resolved now.

As a follow-up for this: newer versions of the container runtime (since this week's betas with pressure-vessel 0.20231128.0) have some code to help to diagnose issues like this one. When run with STEAM_LINUX_RUNTIME_VERBOSE=1, the log will now include the full contents of the ldconfig cache, which should help us to identify if required files have gone missing from the cache.