map::tree_bins::concurrent_tree_bin: attempt to subtract with overflow

Question

map::tree_bins::concurrent_tree_bin: attempt to subtract with overflow

jonhoo opened this issue 4 years ago · comments

Hit two of these at the same time. This is after #85.

test map::tree_bins::concurrent_tree_bin ...
thread '<unnamed>' panicked at 'attempt to subtract with overflow', src/map.rs:1163:17
stack backtrace:
...
  13: core::panicking::panic
             at src/libcore/panicking.rs:54
  14: flurry::map::HashMap<K,V,S>::add_count
             at src/map.rs:1163
  15: flurry::map::HashMap<K,V,S>::replace_node
             at src/map.rs:2626
  16: flurry::map::HashMap<K,V,S>::remove
             at src/map.rs:2366
  17: flurry::map::tree_bins::concurrent_tree_bin::{{closure}}
             at src/map.rs:3429

thread '<unnamed>' panicked at 'attempt to add with overflow', src/map.rs:1159:17
...
  13: core::panicking::panic
             at src/libcore/panicking.rs:54
  14: flurry::map::HashMap<K,V,S>::add_count
             at src/map.rs:1159
  15: flurry::map::HashMap<K,V,S>::put
             at src/map.rs:1970
  16: flurry::map::HashMap<K,V,S>::insert
             at src/map.rs:1625
  17: flurry::map::tree_bins::concurrent_tree_bin::{{closure}}
             at src/map.rs:3419

DQ · Answer 1 · Sat Apr 11 2020 19:17:57 GMT+0800 (China Standard Time)

Current best guess for the order of events would be as follows:

We have a (regular) bin with 1 element, which is the only element in the map (count == 1).
That element gets removed by thread 1, which is paused before the call to add_count.
An element for the same key is inserted by thread 2, which also is paused before add_count (note that all the count updates happen outside of the respective critical sections of the corresponding methods).
Thread 3 now removes this element again, and decrements count to 0.
Thread 1 gets to run again and decrements count to usize::MAX.
Thread 2 gets to run and increments count to 0.

As of yet unsure as to why this is be a problem for us, but wouldn't be for the Java implementation. There is a validated boolean in Java's replaceNode which is omitted in our implementation due to match/continue, but I don't see how that would be the culprit.

DQ · Answer 2 · Sat Apr 11 2020 20:40:19 GMT+0800 (China Standard Time)

The Java implementation has all the size information as long, it's possible that they just
allow this. See also their implementation of size, which essentially clamps the actual computed value to between 0 and Integer.MAX_INT. The tree bin test may just be the first to trigger this for us.

It is also possible that the shared counters we don't yet have influence this, there seems to be some kind of contention detection there. There is also this annotation on the counter cells. But I think it is still possible for the computed value to be negative upon call to size, and that they use long just so they can perform bounds checks on int.

Jon Gjengset · Answer 3 · Sun Apr 12 2020 06:25:02 GMT+0800 (China Standard Time)

That's fascinating... I mean, I suppose we could just move it to an AtomicIsize instead... I guess they decided it wasn't worth the cost to keep the count accurate at all times. Mind sending a PR?

DQ · Answer 4 · Sun Apr 12 2020 16:48:36 GMT+0800 (China Standard Time)

Sure. I'll put together a PR for this and one for #83 when I'm back home. Should work out to fit in tomorrow.

Jon Gjengset · Answer 5 · Tue Apr 14 2020 00:46:29 GMT+0800 (China Standard Time)

I can confirm that this was fixed by #88 (perhaps unsurprisingly) after having run it in a loop for a while.