Racy test failure: segfault in map::tree_bins::concurrent_tree_bin

Question

Racy test failure: segfault in map::tree_bins::concurrent_tree_bin

jonhoo opened this issue 4 years ago · comments

This is this issue which I've factored out into its own issue. Basically, the map::tree_bins::concurrent_tree_bin test occasionally segfaults for me without a backtrace. I can reproduce on current nightly on Linux by running this command for a while:

$ while cargo test --lib map::tree_bins::concurrent_tree_bin -- --test-threads=1 --nocapture; do :; done

Using gdb, I managed to capture a stack trace:

Thread 21 "flurry-27205002" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5632700 (LWP 349291)]
std::thread::Thread::unpark () at src/libstd/thread/mod.rs:1191
1191    src/libstd/thread/mod.rs: No such file or directory.
(gdb) bt
#0  std::thread::Thread::unpark () at src/libstd/thread/mod.rs:1191
#1  0x00005555555eadb2 in flurry::node::TreeBin<K,V>::find (bin=..., hash=0, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/node.rs:472
#2  0x00005555555e3594 in flurry::raw::Table<K,V>::find (self=0x5555557839d0, bin=0x555555784f70, hash=0, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/raw/mod.rs:174
#3  0x00005555555bc56c in flurry::map::HashMap<K,V,S>::get_node (self=0x555555780b40, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/map.rs:1314
#4  0x00005555555bcd1e in flurry::map::HashMap<K,V,S>::get (self=0x555555780b40, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/map.rs:1387
#5  0x000055555559540e in flurry::map::tree_bins::concurrent_tree_bin::{{closure}} () at src/map.rs:3406
#6  0x00005555555abe91 in std::sys_common::backtrace::__rust_begin_short_backtrace (f=...) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/sys_common/backtrace.rs:130
#7  0x000055555558f4a1 in std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} () at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/thread/mod.rs:475
#8  0x00005555555d6db1 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=..., _args=()) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panic.rs:318
#9  0x00005555555a353a in std::panicking::try::do_call (data=0x7ffff5631998 "0\vxUUU\000") at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panicking.rs:331
#10 0x00005555555a370d in __rust_try ()
#11 0x00005555555a3393 in std::panicking::try (f=...) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panicking.rs:274
#12 0x00005555555d6e31 in std::panic::catch_unwind (f=...) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panic.rs:394
#13 0x000055555558ec19 in std::thread::Builder::spawn_unchecked::{{closure}} () at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/thread/mod.rs:474
#14 0x00005555555875ae in core::ops::function::FnOnce::call_once{{vtable-shim}} () at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:232
#15 0x00005555556cf61f in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/94d346360da50f159e0dc777dc9bc3c5b6b51a00/src/liballoc/boxed.rs:1008
#16 0x00005555556e25b3 in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/94d346360da50f159e0dc777dc9bc3c5b6b51a00/src/liballoc/boxed.rs:1008
#17 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#18 0x00007ffff7f7746f in start_thread () from /usr/lib/libpthread.so.0
#19 0x00007ffff7e8d3d3 in clone () from /usr/lib/libc.so.6

DQ · Answer 1 · Fri Apr 10 2020 23:50:43 GMT+0800 (China Standard Time)

Also unable to reproduce on Windows stable as of now. The location makes me think of
the "race" discussed in #72 (review) as a likely candidate. We talked about how the tokens are required because they might have to be available before the waiting threads parks. However, your concern there may still be valid wrt. the reading thread seeing the stored WAITER in lock_state, the stored waiter as non-null, but in the meantime the writing thread re-checks lock_state and not only swaps out the waiter, but cleans it up with into_owned. Might need to be defer_destroy, to handle the above case.

DQ · Answer 2 · Fri Apr 10 2020 23:51:26 GMT+0800 (China Standard Time)

I need to leave now and will have to come back to this (and maybe setup a Linux nightly for testing). If you have time, maybe try this out in the meantime.

Jon Gjengset · Answer 3 · Fri Apr 10 2020 23:52:53 GMT+0800 (China Standard Time)

Hmm, I wonder why the Java code does not have to deal with that...

DQ · Answer 4 · Fri Apr 10 2020 23:54:59 GMT+0800 (China Standard Time)

If this ends up being the cause, it would be because the reading thread holds a reference to the Thread handle in question, so it cannot get GC'd (they don't [have to] use atomic pointers)

Jon Gjengset · Answer 5 · Sat Apr 11 2020 00:00:45 GMT+0800 (China Standard Time)

Ah, that's a good point. Let me try making that a deferred destroy.