jonhoo / flurry

The Java code modified large bins to be trees instead of lists. We should do the same. See the TreeNode optimization and the implementation notes on that optimization for much more detail.

I've been looking through the Java code to get a feel for this. Some things I noticed:

Firstly, and most problematic, the Java implementation orders the TreeNodes by
1. hash value
2. compareTo order on the keys if (essentially) key1 instanceof Comparable<key2.getClass()>
3. Tie-breakers
  1. String comparison of key class names
  2. Comparison of System.identityHashCode(key), which is essentially also key.hashCode(), except it doesn't care if key provides its own hashCode implementation
I see several potential problems here, namely:
- (ii) is a runtime check on the Java Comparable interface, which heavily uses reflection. We would need to somehow express this with Ord or probably PartialOrd trait bounds, but these cannot be checked at runtime.
- I don't know how one would do (a) in Rust
- (b) isn't really useful for us, since we use a hasher anyways.
This is somewhat important, since one of the reasons for having the optimization is performance in the case that hash values collide, which is exactly the case where the Java code would use (ii) or (iii).
TreeBins have an additional locking mechanism for when a tree is re-balanced, as re-balancing may necessitate structural changes. The Java code does this by storing a reference to the current thread in the TreeBin, which is set when waiting for this lock. This is paired with LockSupport#park()ing the waiting thread. That thread is then unparked by a different thread when finding a node. We'll need to find a way model this behaviour.

I think adding a requirement that K: Ord + PartialOrd is fine given the gravity of the improvements this optimization buys us.

As for storing thread referenced and parking/unparking threads, that should work just fine with std::thread::current. Alternatively we could use the parking primitive exposed by parking_lot — either will do.

Thanks for the suggestions concerning the thread lock! std::thread in particular looks very straightforward as a replacement for the Java. One of the reasons I flagged this is that I feel its worth at least thinking about no_std in case we decide to do it (I didn't follow the respective threads in detail, but understand that it is on hold until actual use-cases arise). Using a std-only solution here would affect basically all of the map, since all methods may perform operations on TreeBins. Of course we could (and I think we should) start with one of the above and defer action on this to a potential future PR aimed at no_std.

Having an ordered key type would certainly be helpful. If we require Ord, there would be no need for tie-breakers anymore since either we get a direction or we have found the node (since Rust assumes Ord/PartialOrd/Eq match and Eq is how we compare keys at the moment). Only having PartialOrd would leave the tie-breaker question open. The only immediate use-case for the map I can come up with with keys that are PartialOrd but not Ord is floating point numbers. At work we often have mappings from some form of time step to a measured value. Not sure how relevant excluding this is though.

I don't know that we actually have a tie-breaker in Rust if we don't require Ord. Since Rust things aren't generally heap-allocate, the address is not a good proxy to use. I think we just straight-up require Ord, and then it's up to the user how to bridge PartialOrd if they need to.

For no_std, I don't think it's worth keeping in mind for this change. We should make the change in the best way we can. For no_std, the API will have to change anyway, so we shouldn't try to design for it ahead of time.

I agree, as indicated, but I think it was worth discussing. I'll probably not get to work on any implementation in this or the coming week, so if anyone comes across this and wants to start implementing TreeBins please feel free 👍

Alright, I'm back! It seems no one worked on this so far, so I started with the implementation. I'll report back when I have something working. So far I implemented the tree nodes themselves and the red-black tree methods. After insertion/deletion of tree nodes I should at least be able to construct some TreeBins.

Also, I'm currently on 111 derefs/deref_muts for all the tree handling methods (they show up 'cause I saved safety arguments for later) 😅

As I'm working through this, I think there might be a problem with implementing the additional locking mechanism. The Java code for finding a node (reading a value) uses the approach of following the tree pointers (left/right) if there is no tree operation in progress, and following the standard next pointers of any Node if there is. The next pointers are linear and do not change. This allows the assumption that only one thread ever needs to wait for an operation, since readers never wait. Only writers wait for other things to finish.

What is tricky with this is that all of these pointers are accessed via the same node. So a writing thread might deref_mut a Shared to a TreeNode to reorder the tree while a reading threads wants to deref the same node to follow its next pointer. I would assume this causes trouble, even if none of the mutated attributes are accessed?

Yeah, you certainly can't give out a & and a &mut to one thing at the same time. But why does the writing thread need deref_mut? Shouldn't it only need & as well, and then use atomic operations (which only require &) to modify left and right?

It may be best to stick to that. The left and right pointers are already atomic, but in the Java code the flag indicating whether a TreeNode is red or black is not specially protected. So my first (directly translated) implementation modifies this property directly (hence the mut). The flag does not technically need to be atomic, since all writing operations are synchronized via the bin lock, so I was thinking about if this could be handled with less overhead (which would probably have been good to put in my comment, as you can't know what's in my head...).

I know Cell is the basic thing to use for interior mutability, do you know if that's actually better than atomics if we guarantee synchronized access or if there is a better way to do this? Maybe I'm thinking about this too much, it's just a situation where I feel I lack the experience to instinctively have a good answer.

@domenicquirl I would use AtomicBool and then use Ordering::Relaxed — performance-wise, that should be fine :)

Any operation that takes the additional TreeBin lock takes the precaution of having the critical section run inside a try with an associated finally that releases the lock even if an error occurred. What would be a good way to handle this in Rust?

Edit: an example of this behaviour in the Java code can be seen here:

flurry/jsr166/src/ConcurrentHashMap.java

Lines 2937 to 2942 in 3bf0c4f

    
           lockRoot(); 
        
           try { 
        
               root = balanceInsertion(root, x); 
        
           } finally { 
        
               unlockRoot(); 
        
           }

I think that's just in case of exceptions, which would be equivalent to panics in Rust. They already drop on unwind, so I don't think you need the equivalent of that in the Rust code?

I'm not sure. The Java code has no explicit Exceptions, or ones that would be obviously caused by anything. It's just pointer swaps. Maybe they're just extra careful, I wouldn't know where their exceptions would come from. I've left it out for now, but wanted to double check.

A small update on this: I'm back from my holiday and finished porting the code. I crashed some tests, but when working through the resulting errors I could verify that TreeBin cases are being hit. I still need to add some info about TreeBins to the global documentation, then this should be ready for review.

With the TreeBin code added, clippy now complains about cognitive complexity in some functions. For the time being, I just #![allow(...)]ed this - I don't think moving code around helps a lot if we lose the reference to the Java code. But I'm open for different approaches.

I also ran our benchmarks against the new code, more so as a more heavy-weight test than for the actual numbers. Nevertheless, the rough direction for performance seems very nice for insert, where the improvement scales visibly with the number of threads. Iterator related benches also went up significantly, get and friends went down locally (though less than the others went up, and more constant instead of scaling with threads). The latter makes some sense given that lookups in TreeBins are slower per-step, but still confuses my because it also happens in environments without writing threads (so all reads should be able to make use of the ordered trees). Maybe the number of elements is too low still in the benchmarks or something. Anyhow, I wouldn't put too much value into the numbers yet from this one run, this is mostly intended as informative ^^

	lockRoot();
	try {
	root = balanceInsertion(root, x);
	} finally {
	unlockRoot();
	}

Implement the tree-bin optimization