Racy test failure: treeifying a Moved entry
jonhoo opened this issue · comments
$ cargo test --test jdk concurrent_associate::test_concurrent_insert
thread '<unnamed>' panicked at 'internal error: entered unreachable code: treeifying a Moved entry', /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16:9
stack backtrace:
0: backtrace::backtrace::libunwind::trace
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
1: backtrace::backtrace::trace_unsynchronized
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
2: std::sys_common::backtrace::_print_fmt
at src/libstd/sys_common/backtrace.rs:78
3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
at src/libstd/sys_common/backtrace.rs:59
4: core::fmt::write
at src/libcore/fmt/mod.rs:1069
5: std::io::Write::write_fmt
at src/libstd/io/mod.rs:1504
6: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:62
7: std::sys_common::backtrace::print
at src/libstd/sys_common/backtrace.rs:49
8: std::panicking::default_hook::{{closure}}
at src/libstd/panicking.rs:198
9: std::panicking::default_hook
at src/libstd/panicking.rs:218
10: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:511
11: rust_begin_unwind
at src/libstd/panicking.rs:419
12: std::panicking::begin_panic_fmt
at src/libstd/panicking.rs:373
13: flurry::map::HashMap<K,V,S>::treeify_bin
at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16
14: flurry::map::HashMap<K,V,S>::put
at ./src/map.rs:1952
15: flurry::map::HashMap<K,V,S>::insert
at ./src/map.rs:1625
16: jdk::concurrent_associate::insert
at tests/jdk/concurrent_associate.rs:24
17: core::ops::function::Fn::call
at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:72
18: jdk::concurrent_associate::test_once::{{closure}}
at tests/jdk/concurrent_associate.rs:52
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test concurrent_associate::test_concurrent_insert ... FAILED
The map::tree_bins::concurrent_tree_bin
test also occasionally segfaults for me without a backtrace. Not sure if that traces back to the same issue, but it's something we should definitely look into! I can reproduce on current nightly on Linux by running this command for a while:
$ while cargo test --lib map::tree_bins::concurrent_tree_bin -- --test-threads=1 --nocapture; do :; done
I think it reproduces more frequently if you are also running other things in the background to cause more scheduler noise (all the more reason for us to get #34 :p)
I factored out the segfault into #84
So far unable to reproduce on Windows stable. The only place where treeify_bin
is called is in put
. I don't think the note there about the difference compared to the Java code matters here, since bin_count
has to satisfy >= TREEIFY_THRESHOLD
.
The Java code also checks this condition inside the case for non-MOVED
nodes, but outside the synchronized
section. So hitting MOVED
specifically is not possible in the Java code. But I think technically we can transfer
before we re-load the bin in treeify_bin
?
Bah, just got another crash that seems related with #85 applied:
thread '<unnamed>' panicked at 'internal error: entered unreachable code: treeifying a Tree', /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16:9
stack backtrace:
...
13: flurry::map::HashMap<K,V,S>::treeify_bin
at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16
14: flurry::map::HashMap<K,V,S>::put
at ./src/map.rs:1952
15: flurry::map::HashMap<K,V,S>::insert
at ./src/map.rs:1625
16: jdk::concurrent_associate::insert
at tests/jdk/concurrent_associate.rs:24
17: core::ops::function::Fn::call
at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:72
18: jdk::concurrent_associate::test_once::{{closure}}
at tests/jdk/concurrent_associate.rs:52
I'm still on the road, so no tests. But the more I look at this, the more I think there isn't actually a problem. If you look at the original implementation, you'll see that it doesn't have the panics, as the Java code also doesn't trigger a failure there. They were then added during review so we notice if we try to treeify something that is not a tree bin.
Now, due to the resoning from my last comment, in the Java code it must also be possible to call treeifyBin
with something other than a regular Node
(also I think they would have noticed if the synchronization was a problem there). Of course you could force the call to treeifyBin
to happen inside the critical section by executing it and the comparison inside the synchronized
(while holding the lock). But let's think about the cases you encountered:
- One thread inserts a new entry and passes the
TREEIFY_THRESHOLD
, but a different thread executes a move of that bin. Under the assumption that the treefication is always forced, it is very likely that the move will split the bin in question into two smaller bins, both below the threshold, and has to untreeify the bin again. If it is not, and due to scheduling the move happens first, it is fine for the first thread to see aMOVED
upon re-check and just not treeify the bin. - In the same scenario, if there is one insert passing the threshold and then another insert before treeification happens which also gets to treeifying the bin (due to scheduling), then the bin is already treeified for the first thread and it is again fine not to
treefy_bin
. Of course there is a tradeoff here were the second insert would already have happend into a tree bin if we forced treeification under the lock, but the second thread would also have to wait for the lock before the insert.
So I think the semantics of the Java code here are that of a rather subtle else { // pass }
, and the issue is that we made this crash on Moved
and TreeBin
(TreeNode
is fine, since it shouldn't be the head of a bin). I suggest matching in the loaded bin in treeify_bin
with the existing treeify case for Node
, two empty cases for TreeBin
and Moved
with an explanation and keeping the panic only for TreeNode
.
Ah, that's interesting. Yeah, I think I buy it. Would you mind filing a PR that replaces those assertions with comments outlining what you wrote before for why we don't want to assert?
I can confirm that this was fixed by #87 after having run it in a loop for a while.