Racy test failure: treeifying a Moved entry

Question

Racy test failure: treeifying a Moved entry

jonhoo opened this issue 4 years ago · comments

$ cargo test --test jdk concurrent_associate::test_concurrent_insert
thread '<unnamed>' panicked at 'internal error: entered unreachable code: treeifying a Moved entry', /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16:9
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1069
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1504
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:218
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:511
  11: rust_begin_unwind
             at src/libstd/panicking.rs:419
  12: std::panicking::begin_panic_fmt
             at src/libstd/panicking.rs:373
  13: flurry::map::HashMap<K,V,S>::treeify_bin
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16
  14: flurry::map::HashMap<K,V,S>::put
             at ./src/map.rs:1952
  15: flurry::map::HashMap<K,V,S>::insert
             at ./src/map.rs:1625
  16: jdk::concurrent_associate::insert
             at tests/jdk/concurrent_associate.rs:24
  17: core::ops::function::Fn::call
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:72
  18: jdk::concurrent_associate::test_once::{{closure}}
             at tests/jdk/concurrent_associate.rs:52
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test concurrent_associate::test_concurrent_insert ... FAILED

Jon Gjengset · Answer 1 · Fri Apr 10 2020 23:09:04 GMT+0800 (China Standard Time)

The map::tree_bins::concurrent_tree_bin test also occasionally segfaults for me without a backtrace. Not sure if that traces back to the same issue, but it's something we should definitely look into! I can reproduce on current nightly on Linux by running this command for a while:

$ while cargo test --lib map::tree_bins::concurrent_tree_bin -- --test-threads=1 --nocapture; do :; done

Jon Gjengset · Answer 2 · Fri Apr 10 2020 23:09:34 GMT+0800 (China Standard Time)

I think it reproduces more frequently if you are also running other things in the background to cause more scheduler noise (all the more reason for us to get #34 :p)

Jon Gjengset · Answer 3 · Fri Apr 10 2020 23:25:01 GMT+0800 (China Standard Time)

I factored out the segfault into #84

DQ · Answer 4 · Fri Apr 10 2020 23:42:11 GMT+0800 (China Standard Time)

So far unable to reproduce on Windows stable. The only place where treeify_bin is called is in put. I don't think the note there about the difference compared to the Java code matters here, since bin_count has to satisfy >= TREEIFY_THRESHOLD.

The Java code also checks this condition inside the case for non-MOVED nodes, but outside the synchronized section. So hitting MOVED specifically is not possible in the Java code. But I think technically we can transfer before we re-load the bin in treeify_bin?

Jon Gjengset · Answer 5 · Sat Apr 11 2020 01:18:16 GMT+0800 (China Standard Time)

Bah, just got another crash that seems related with #85 applied:

thread '<unnamed>' panicked at 'internal error: entered unreachable code: treeifying a Tree', /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16:9
stack backtrace:
...
  13: flurry::map::HashMap<K,V,S>::treeify_bin
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:16
  14: flurry::map::HashMap<K,V,S>::put
             at ./src/map.rs:1952
  15: flurry::map::HashMap<K,V,S>::insert
             at ./src/map.rs:1625
  16: jdk::concurrent_associate::insert
             at tests/jdk/concurrent_associate.rs:24
  17: core::ops::function::Fn::call
             at /home/jon/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libcore/ops/function.rs:72
  18: jdk::concurrent_associate::test_once::{{closure}}
             at tests/jdk/concurrent_associate.rs:52

DQ · Answer 6 · Sat Apr 11 2020 18:28:54 GMT+0800 (China Standard Time)

I'm still on the road, so no tests. But the more I look at this, the more I think there isn't actually a problem. If you look at the original implementation, you'll see that it doesn't have the panics, as the Java code also doesn't trigger a failure there. They were then added during review so we notice if we try to treeify something that is not a tree bin.

Now, due to the resoning from my last comment, in the Java code it must also be possible to call treeifyBin with something other than a regular Node (also I think they would have noticed if the synchronization was a problem there). Of course you could force the call to treeifyBin to happen inside the critical section by executing it and the comparison inside the synchronized (while holding the lock). But let's think about the cases you encountered:

One thread inserts a new entry and passes the TREEIFY_THRESHOLD, but a different thread executes a move of that bin. Under the assumption that the treefication is always forced, it is very likely that the move will split the bin in question into two smaller bins, both below the threshold, and has to untreeify the bin again. If it is not, and due to scheduling the move happens first, it is fine for the first thread to see a MOVED upon re-check and just not treeify the bin.
In the same scenario, if there is one insert passing the threshold and then another insert before treeification happens which also gets to treeifying the bin (due to scheduling), then the bin is already treeified for the first thread and it is again fine not to treefy_bin. Of course there is a tradeoff here were the second insert would already have happend into a tree bin if we forced treeification under the lock, but the second thread would also have to wait for the lock before the insert.

So I think the semantics of the Java code here are that of a rather subtle else { // pass }, and the issue is that we made this crash on Moved and TreeBin (TreeNode is fine, since it shouldn't be the head of a bin). I suggest matching in the loaded bin in treeify_bin with the existing treeify case for Node, two empty cases for TreeBin and Moved with an explanation and keeping the panic only for TreeNode.

Jon Gjengset · Answer 7 · Sun Apr 12 2020 06:23:18 GMT+0800 (China Standard Time)

Ah, that's interesting. Yeah, I think I buy it. Would you mind filing a PR that replaces those assertions with comments outlining what you wrote before for why we don't want to assert?

Jon Gjengset · Answer 8 · Tue Apr 14 2020 00:46:30 GMT+0800 (China Standard Time)

I can confirm that this was fixed by #87 after having run it in a loop for a while.