Racy test failure: segfault in map::tree_bins::concurrent_tree_bin
jonhoo opened this issue · comments
This is this issue which I've factored out into its own issue. Basically, the map::tree_bins::concurrent_tree_bin
test occasionally segfaults for me without a backtrace. I can reproduce on current nightly on Linux by running this command for a while:
$ while cargo test --lib map::tree_bins::concurrent_tree_bin -- --test-threads=1 --nocapture; do :; done
Using gdb
, I managed to capture a stack trace:
Thread 21 "flurry-27205002" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5632700 (LWP 349291)]
std::thread::Thread::unpark () at src/libstd/thread/mod.rs:1191
1191 src/libstd/thread/mod.rs: No such file or directory.
(gdb) bt
#0 std::thread::Thread::unpark () at src/libstd/thread/mod.rs:1191
#1 0x00005555555eadb2 in flurry::node::TreeBin<K,V>::find (bin=..., hash=0, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/node.rs:472
#2 0x00005555555e3594 in flurry::raw::Table<K,V>::find (self=0x5555557839d0, bin=0x555555784f70, hash=0, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/raw/mod.rs:174
#3 0x00005555555bc56c in flurry::map::HashMap<K,V,S>::get_node (self=0x555555780b40, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/map.rs:1314
#4 0x00005555555bcd1e in flurry::map::HashMap<K,V,S>::get (self=0x555555780b40, key=0x7ffff56317f8, guard=0x7ffff56317b8) at src/map.rs:1387
#5 0x000055555559540e in flurry::map::tree_bins::concurrent_tree_bin::{{closure}} () at src/map.rs:3406
#6 0x00005555555abe91 in std::sys_common::backtrace::__rust_begin_short_backtrace (f=...) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/sys_common/backtrace.rs:130
#7 0x000055555558f4a1 in std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}} () at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/thread/mod.rs:475
#8 0x00005555555d6db1 in <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (self=..., _args=()) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panic.rs:318
#9 0x00005555555a353a in std::panicking::try::do_call (data=0x7ffff5631998 "0\vxUUU\000") at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panicking.rs:331
#10 0x00005555555a370d in __rust_try ()
#11 0x00005555555a3393 in std::panicking::try (f=...) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panicking.rs:274
#12 0x00005555555d6e31 in std::panic::catch_unwind (f=...) at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/panic.rs:394
#13 0x000055555558ec19 in std::thread::Builder::spawn_unchecked::{{closure}} () at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/thread/mod.rs:474
#14 0x00005555555875ae in core::ops::function::FnOnce::call_once{{vtable-shim}} () at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:232
#15 0x00005555556cf61f in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/94d346360da50f159e0dc777dc9bc3c5b6b51a00/src/liballoc/boxed.rs:1008
#16 0x00005555556e25b3 in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/94d346360da50f159e0dc777dc9bc3c5b6b51a00/src/liballoc/boxed.rs:1008
#17 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#18 0x00007ffff7f7746f in start_thread () from /usr/lib/libpthread.so.0
#19 0x00007ffff7e8d3d3 in clone () from /usr/lib/libc.so.6
Also unable to reproduce on Windows stable as of now. The location makes me think of
the "race" discussed in #72 (review) as a likely candidate. We talked about how the tokens are required because they might have to be available before the waiting threads park
s. However, your concern there may still be valid wrt. the reading thread seeing the stored WAITER
in lock_state
, the stored waiter
as non-null, but in the meantime the writing thread re-checks lock_state
and not only swaps out the waiter
, but cleans it up with into_owned
. Might need to be defer_destroy
, to handle the above case.
I need to leave now and will have to come back to this (and maybe setup a Linux nightly for testing). If you have time, maybe try this out in the meantime.
Hmm, I wonder why the Java code does not have to deal with that...
If this ends up being the cause, it would be because the reading thread holds a reference to the Thread handle in question, so it cannot get GC'd (they don't [have to] use atomic pointers)
Ah, that's a good point. Let me try making that a deferred destroy.