SchrodingerZhu / snmalloc-rs

rust bindings of snmalloc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

failing compilation with musl

Licenser opened this issue · comments

Hi,
I'm trying to build an application using snmalloc in with a statically linked musl libc instead of the gnu libc, during that we've been running in a few problems.

I am not sure 100% where the issue orriginates, from my research it seems to be related to the usage of C++ and global destructors.

The error boils down to:

/usr/local/bin/../lib/gcc/x86_64-linux-musl/9.2.0/../../../../x86_64-linux-musl/bin/ld: /target/x86_64-unknown-linux-musl/debug/deps/tremor-17be01d5a5f619ac: hidden symbol `__dso_handle' isn't defined

To make it easy I've created a minimal reproduction with nothing but snmalloc-rs and a mainline function:

https://github.com/Licenser/snmalloc-musl

The error can be triggered by simply running make, it will build a musl docker image and then try to compile the application in it.

At'ing @mjp41 for visibility in case similar musl issues occur with other snmalloc bindings in other PLs

__dso_handle is a magic symbol that's expected to be present with hidden visibility in every ELF shared library. It's passed to the C++ runtime function that handles destructors so that, when a library is unloaded all of the destructors can be run (and all of the thread-local destructors can be skipped if the library has been unloaded). It's usually provided by the CSU parts and might be missing with MUSL for statically linked binaries. I don't know why MUSL doesn't define it, because the C++ ABI requires that it exists, even for static linking.

Last time I looked at the musl, it didn't correctly handle library unloading even with shared libraries. It also doesn't implement __cxa_thread_atexit_impl or __cxa_thread_atexit at all, so the C++ runtime library has to fall back to using pthread_setspecific or similar for thread-local cleanup. I don't know what the implication of that is for snmalloc: It doesn't look as if musl needs to call free during thread teardown after it runs the thread-local destructors, so it's probably fine.

It is probably fine to just define something like this for static linking:

__attribute__((__visibility__((hidden)))
char __dso_handle;

You might actually be able to get away with a weak symbol: it doesn't matter if it resolves to 0, because the C++ runtime won't do anything with it other than equality comparisons.

It would be nice if more libc implementations had the explicit 'destroy malloc state for this thread' hook that FreeBSD libc provides.

So I added

#[no_mangle]
pub static __dso_handle: u8 = 0;

to the mainline, that resolves the compilation issue but causes it to segfault:

(lldb) run
Process 547117 launched: '/home/heinz/Projects/Other/snmalloc-musl/target/x86_64-unknown-linux-musl/debug/snmalloc-musl' (x86_64)
Process 547117 stopped
* thread #1, name = 'snmalloc-musl', stop reason = signal SIGSEGV: invalid address (fault address: 0x7ff6)
    frame #0: 0x0000000000007ff6
error: memory read failed for 0x7e00
(lldb) bt
* thread #1, name = 'snmalloc-musl', stop reason = signal SIGSEGV: invalid address (fault address: 0x7ff6)
  * frame #0: 0x0000000000007ff6
    frame #1: 0x00007fffe7ddf0bf snmalloc-musl`void* snmalloc::Allocator<&(snmalloc::needs_initialisation(void*)), &(snmalloc::init_thread_allocator(snmalloc::function_ref<void* (void*)>)), snmalloc::MemoryProviderStateMixin<snmalloc::PALLinux>, snmalloc::DefaultChunkMap<snmalloc::GlobalPagemapTemplate<snmalloc::FlatPagemap<20ul, unsigned char> > >, true>::small_alloc_first_alloc<(snmalloc::ZeroMem)0, (snmalloc::AllowReserve)1>(unsigned long, unsigned long) at threadalloc.h:248:72
    frame #2: 0x00007fffe7ddf098 snmalloc-musl`void* snmalloc::Allocator<&(snmalloc::needs_initialisation(void*)), &(snmalloc::init_thread_allocator(snmalloc::function_ref<void* (void*)>)), snmalloc::MemoryProviderStateMixin<snmalloc::PALLinux>, snmalloc::DefaultChunkMap<snmalloc::GlobalPagemapTemplate<snmalloc::FlatPagemap<20ul, unsigned char> > >, true>::small_alloc_first_alloc<(snmalloc::ZeroMem)0, (snmalloc::AllowReserve)1>(unsigned long, unsigned long) at threadalloc.h:292
    frame #3: 0x00007fffe7ddf098 snmalloc-musl`void* snmalloc::Allocator<&(snmalloc::needs_initialisation(void*)), &(snmalloc::init_thread_allocator(snmalloc::function_ref<void* (void*)>)), snmalloc::MemoryProviderStateMixin<snmalloc::PALLinux>, snmalloc::DefaultChunkMap<snmalloc::GlobalPagemapTemplate<snmalloc::FlatPagemap<20ul, unsigned char> > >, true>::small_alloc_first_alloc<(this=<unavailable>, sizeclass=0, size=4)0, (snmalloc::AllowReserve)1>(unsigned long, unsigned long) at alloc.h:1142
    frame #4: 0x00007fffe7d8e684 snmalloc-musl`__rg_alloc [inlined] _$LT$snmalloc_rs..SnMalloc$u20$as$u20$core..alloc..global..GlobalAlloc$GT$::alloc::h0b33f8bdc654991c(self=0x00007fffe7de1870, layout=Layout @ 0x00007fffffffd650) at lib.rs:47:9
    frame #5: 0x00007fffe7d8e65d snmalloc-musl`__rg_alloc(arg0=4, arg1=1) at main.rs:2
    frame #6: 0x00007fffe7da3ea5 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::alloc::alloc::hc943e1a883a741c4 at alloc.rs:84:14
    frame #7: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::alloc::Global::alloc_impl::h2bac69f5a7e8918b at alloc.rs:164
    frame #8: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] _$LT$alloc..alloc..Global$u20$as$u20$core..alloc..AllocRef$GT$::alloc::h7513b9caadea9287 at alloc.rs:224
    frame #9: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::raw_vec::RawVec$LT$T$C$A$GT$::allocate_in::hf472df56c401029c at raw_vec.rs:189
    frame #10: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::raw_vec::RawVec$LT$T$C$A$GT$::with_capacity_in::ha57ededbd232e7a3 at raw_vec.rs:130
    frame #11: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::raw_vec::RawVec$LT$T$GT$::with_capacity::h8e52d47795fd752d at raw_vec.rs:93
    frame #12: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::vec::Vec$LT$T$GT$::with_capacity::h7fca68cb5cfe8e92 at vec.rs:363
    frame #13: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::slice::hack::to_vec::h118ebe2f38ae973e at slice.rs:159
    frame #14: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::slice::_$LT$impl$u20$$u5b$T$u5d$$GT$::to_vec::h6bb26fd0925afb64 at slice.rs:395
    frame #15: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::slice::_$LT$impl$u20$alloc..borrow..ToOwned$u20$for$u20$$u5b$T$u5d$$GT$::to_owned::h206c5471a834f4cd at slice.rs:728
    frame #16: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc [inlined] alloc::str::_$LT$impl$u20$alloc..borrow..ToOwned$u20$for$u20$str$GT$::to_owned::h9d7e3d52bbc9568d at str.rs:205
    frame #17: 0x00007fffe7da3e95 snmalloc-musl`std::rt::lang_start_internal::h53bf2c2c6c9e97dc at rt.rs:44
    frame #18: 0x00007fffe7d8e597 snmalloc-musl`std::rt::lang_start::h41f2d223c97aece7(main=(snmalloc-musl`snmalloc_musl::main::hbdf7eca5d9ede0a9 at main.rs:7), argc=1, argv=0x00007fffffffd808) at rt.rs:65:5
    frame #19: 0x00007fffe7d8e85a snmalloc-musl`main + 42
    frame #20: 0x00007fffe7dc6afb snmalloc-musl`libc_start_main_stage2 + 39
    frame #21: 0x00007fffe7d8e097 snmalloc-musl`_start + 22

It looks as if this is in the registration of the TLS destructor for OnDestruct. Is it possible for you to break on __cxa_thread_atexit and see what fails? The crash, unfortunately, looks as if something has been tail called and so some of the interesting stack frame is gone. You may need to compile musl and the C++ runtime library (libsupc++ / libc++abi / libcxxrt) in debug mode as well.

I've spent a lot of effort making sure things get tail called ;-)

I think this is inside the C++ runtime, not yours.

It looks like this would be the first place OnDestruct is initialised as this is the main thread, so nothing else will have hit this yet. But are probably going to need a debug C++ rt and musl.

Can we build the snmalloc-rs crate against musl?

Screenshot_2021-01-28-10-24-09-1931163407.png

This is how rust handles the issue. I don't see any differences has been done to musl? Or it just fallbacks?

It looks as if this is in the registration of the TLS destructor for OnDestruct. Is it possible for you to break on __cxa_thread_atexit and see what fails? The crash, unfortunately, looks as if something has been tail called and so some of the interesting stack frame is gone. You may need to compile musl and the C++ runtime library (libsupc++ / libc++abi / libcxxrt) in debug mode as well.

Out of the box, sadly that isn't defined it seems, lldb doesn't find that breakpoint.

Can we build the snmalloc-rs crate against musl?

Yap I created a branch hat includes the bits needed for musl https://github.com/Licenser/snmalloc-rs/tree/musl

cargo install cross
cross test --target x86_64-unknown-linux-musl

runs the tests with musl and has the same results as the example repo

Is it possible to link this with a debug build of musl and whatever C++ runtime you need (or can you install packages for these with separate debug info, if they're dynamically linked)? The bug seems to be in the interaction between clang-generated code from standard C++ language constructs, calling standard Itanium C++ runtime functions, and their implementations in the C++ runtime and libc.

Thou it may be irrelevant, but in my build script I required libstdc++ to be dynamic explicitly. I should have it fixed.

It looks as if Rust does this by using pthread_setspecific to register the TLS destructor. We could add an alternative implementation of ThreadAlloc that does this for musl and other pre-2011 C++ stacks. I'm slightly hesitant to do that because some pthread_setspecific implementations call malloc but I wouldn't have any objection to taking a patch to do it that's sufficiently guarded behind #ifdefs - it's non-invasive.

So we use to have a version that used pthread_setspecific. That should work even with allocation, the call would be inside:

https://github.com/microsoft/snmalloc/blob/a3660c40690252e6bab884108f82e1ca822bc458/src/mem/threadalloc.h#L279-L295

in the register_cleanup call. That is called after the allocator has been allocated, and added to a TLS variable. This means it is actually fine to re-enter the allocator inside the register_cleanup code.

There is an awkward corner case that the destructor code for the thread has cleaned up the allocator, but other destructors will still call the allocator. This is handled by having a thread local that knows whether the destructor has run, and then moves to a very slow path for all allocations and deallocations.

Happy to help if someone wants to do the work.

The pthread version was removed in
microsoft/snmalloc@4ce371f

if we handle this by ourselves, I think that it will also be a great workaround for the MinGW bug addressed in the open PR of snmalloc.

microsoft/snmalloc#217
This would address that. I am not keen to add too many options as it make test coverage harder.

So I think you would need a to something like

void* pthread_make_key()
{
    pthread_key_t key;
    pthread_key_create(&key, ThreadAllocCommon::inner_release);
    return nullptr;
}

and then

    static bool register_cleanup()
    {
#ifndef SNMALLOC_USING_PTHREAD
      static thread_local OnDestruct<ThreadAllocCommon::inner_release> tidier;
#else
      // Don't need pthread_key_t as only want to use singleton to register destructor
      Singleton<void*, pthread_make_key>::get();
#endif

      ThreadAllocCommon::register_cleanup();

      return destructor_has_run;
    }

The other option, of course, is for someone to contribute a __cxa_thread_atexit_impl to musl. It's probably around a dozen lines of code and would fix this for everyone.

Would that fix MingW? But fixing musl would be good.

Would that fix MingW? But fixing musl would be good.

Yes. Basically, the mingw problem is caused by unable to free the thread_local static structs correctly. I believe if we use pthread_make_key approach, the problem can be solved.

I believe if we use pthread_make_key approach, the problem can be solved.

@mjp41 :( well, it was too early for me to have that belief... After several trials, I still cannot fix MinGW support even with pthread_key. That is being said, I think the MUSL problem should be okay.

@Licenser is the MUSL problem still an issue for you? We could try this as a work-around until MUSL is fixed.

Yes but not with a super high priority. It prevents us to make statically linked tremor binaries, which is something we'd love to have, OTOH musl was significantly slower for us in the past when builds were working so they'd just be an addition not 'the' release strategy.

I think it's totally fine to wait for a proper fix and not put a workaround on your plate :) we're already super thankful that you all got to the bottom of this!