Amanieu / intrusive-rs

Intrusive collections for Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature: HashSet

Diggsey opened this issue · comments

I imagine this could work in much the same way as the std::collections::HashSet (in that is uses open addressing) and have the "link" store the index of the bucket an item belongs to.

What is the advantage over just using a std::collections::HashSet directly?

Well, it would be a HashMap if not using an intrusive data structure.

Two benefits:

  1. Speed: the hashing process is slow, particularly when using the default hasher. Storing the bucket index means the actual lookup process is completely free.
  2. Space: sometimes you are using a large key. With a HashMap, that key needs to be stored twice: once in the map itself, and once outside the map/in your intrusive data structure to allow efficiently removing the item from the map.

Also, large keys slow down the hashing process even further.

I mean, wouldn't this be exactly the same as HashSet<Box<Object>>?

  1. I don't understand what you mean by this. If you already have a reference to an object then why would you need to look it up?
  2. I think that what you really want here is a better HashSet where you can customize how the key is extracted from a value.

Maybe a small code example here could help make things clearer.

You can get some of the way there by using a custom newtype around an Rc<Object>. However you still can't remove from a HashSet by reference - it has to go through the rigmarole of extracting the key from the object, hashing the key, finding the key in the set based on the hash and the key value, and only then removing it.

With an intrusive hash set, it doesn't have to do any of that: the link can directly store the bucket index, and removing the element can be done safely without having to use the key at all.

As an example, I may have an Object struct that has a position (x, y), and an ID. With an intrusive hash set, I can keep two sets:

objects_by_id: IntrusiveHashSet<IdAdapter>,
objects_by_pos: IntrusiveHashSet<PosAdapter>,

Now I can create and delete objects, and it's easy to update both sets in the process. If I am indexing more properties, then removing objects gets very slow, as I have to hash many different properties of the same object, and not all of those keys may even be cheap to compute.

I personally don't think that simply caching the bucket index is worth the complexity of implementing yet another hash table. The cost of looking up an item in the map is generally dwarfed by that of removing an element anyways.

As a side note, I just release a new hash table implementation which much faster than the standard library one. You might want to use it if you feel the standard HashMap is too slow.

The cost of looking up an item in the map is generally dwarfed by that of removing an element anyways.

I'm not sure that's true? There might be the occasional degenerate case where you have to do a lot of back-shifting of later elements or resize the hash table, but hashing the key is also costly, especially with SipHasher.

Ideally it would be possible to share or reuse a base hash table implementation, but even if you think that using a standard HashMap or your hashbrown library is preferable, the ability to use a hashtable in the same way as the other intrusive collections would be much more convenient. Maybe you could consider adding an "intrusive hash table" that still looks up elements by the key, but is more convenient to use than a HashSet<NewType<Rc<Foo>>>, and at least leaves open the possibility of providing a faster implementation in the future (such as by caching what bucket an item is in).

I think what you really want here is a way to customize hash tables with a way to extract a key, sort of like what KeyAdapter does.