contain-rs / discuss

A center for discussion and planning of the contain-rs organization.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Extension to BinaryHeap for efficiently modifying the greatest element

bsteinb opened this issue · comments

Hello container enthusiasts.

I have recently been working on a K-Way merge for the itertools crate which internally uses BinaryHeap from std (rust-itertools/itertools#97). Using BinaryHeap's interface as it is, the merge algorithm goes through a lot of pop() -> modify -> optional push() sequences. This is similar to the pop() -> modify -> push() sequence that a decrease-key/increase-key operation can be mapped to. Both sequences can be optimised by allowing (safe) access to the first element of the heap for inspection/modification/removal and only restoring the heap invariant once the operation is complete (unlike pop() -> modify -> push() which has to sift twice).

Inspired by @frankmcsherry's optimised implementation of the algorithm using an ad-hoc heap, I have set out to capture the essence of the optimised access pattern in an interface that does not require access to the internals of the heap data structure. I have added them to a clone of the binary_heap module.

The experiment currently contains two sets of interfaces. One that uses closures for the modify part (or the modify -> optional push() part) and is similar to the poke() interface proposed by @llogiq. I have named the functions pop_push() and pop_and_then_push() which offer two variants of the optimised pop() -> modify -> optional push() sequence and peek_mut() for the optimised pop() -> modify -> push() sequence by way of a mutable reference.

The other interface should cover the same operations but is based on wrapper types that provide access to the greatest element and restore the heap invariant in drop() (similar to Vecs Drain wrapper). The wrapper types are Head which allows pop() -> modify -> optional push() and HeadRef for pop() -> modify -> push().

While the closure based interface is safer (no wrappers that might be leaked), the wrapper interface feels more rustic (just like Iterators and Drain, I guess). The implementations make use of unsafe code for a performance improvement of roughly 2 x in my experiments based on the K-way merge which operate on small heaps.

At the moment, neither interface has the ability to modify any element but the greatest as was discussed in #11.

I now turn to you for feedback on my design. Does the interface seem sensible? Is this something that might be considered for inclusion in std? I know it significantly adds to the surface of BinaryHeaps interface for just an optimisation, but on the other hand it could be used to either re-implement some of the methods (e.g. pop() and replace()) or altogether subsume them.

Just a random programmer from the interwebs, so take my 0.02 with as much salt as you want.

The Head api sounds to me kind of like the Entry api, and seems good to me.

Ok, now for my crazy thoughts. Add the ability for the heap to be in a broken state.
If the heap is in correct form, then pop returns the max item and leaves a hole in the heap.
If the heap has a hole, then pop fixes the hole and then returns the max item and leaves a hole in the heap.
If the heap is in correct form, then push dose the usual.
If the heap has a hole, then push replaces the hole with the new item and fixes the invariant.

I think this naturally optimizes pop then push without new syntax. Is this crazy?

I think it could work with a kind of Entry-like API that mutably borrows the heap, mediating all accesses to it while in an inconsistent state. However, this would hinge on drop() being called which is not guaranteed, so we'd need to poison the heap (e.g. setting size=0 or better some invalid value) while in an invalid state, and restore it on return.

We've added this to the standard library's BinaryHeap: https://doc.rust-lang.org/beta/std/collections/struct.BinaryHeap.html#method.peek_mut. It's up for stabilization in 1.12.

@sfackler apparently this does neither poison nor leak-amplify.

It doesn't need to. The worst that happens if the guard is leaked is that the heap is improperly ordered, which can already happen if an Ord method panics. Things accessing the heap after that happens may see improperly ordered values, but there are no safety issues.

That is true. However, it means the operation does not fail fast, but leaves the heap in an inconsistent state, which means subsequent operations can silently return wrong values. This is a quite unfortunate error state, if you ask me. Poisoning – perhaps at least in debug mode – would be a preferable solution if you ask me.

I'm not aware of a context in which someone would end up accidentally leaking the guard, so it doesn't really seem like something worth investing all that much effort in. I don't believe any other guards poison themselves after a leak (e.g. Vec::drain). They in general do the easiest thing that avoids memory corruption etc.