NVIDIA / cuCollections

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEATURE]: Support fast insert operations for single set/map

PointKernel opened this issue · comments

Is your feature request related to a problem? Please describe.

Currently, cuco hash tables always use atomic CAS to for table insertions.

This could introduce unnecessary costs in the case of perfect hashing where we know the hash values of all inputs are distinct and hash collisions would never happen.

Describe the solution you'd like

Inspired by @esoha-nvidia's comments on perfect hashing, we could pass an is_perfect/is_perfect_hashing argument to insert functions thus and is_perfect == true, the input element is copied to the target bucket directly. Otherwise, it's atomically CASed to the desired bucket.

Describe alternatives you've considered

Apart from adding an extra parameter is_perfect to existing APIs, we could also introduce a new API instead.

The host-bulk API could look like:

  /**
   * @brief Asynchronously inserts all values in the range `[first, last)`.
   *
   * @note: All input values must be distinct. Behavior is undefined if duplicates are present in the input.
   * 
   * @tparam InputIt Device accessible random access input iterator where
   * <tt>std::is_convertible<std::iterator_traits<InputIt>::value_type,
   * static_map<K, V>::value_type></tt> is `true`
   *
   * @param first Beginning of the sequence of values
   * @param last End of the sequence of values
   * @param stream CUDA stream used for insert
   */
  template <typename InputIt>
  void fast_insert_async(InputIt first, InputIt last, cuda_stream_ref stream = {}) noexcept;

The device API:

  /**
   * @brief Copies the given element to its corresponding bucket
   *
   * @note: This function performs a non-atomic hash table insertion by copying the given element
   * directly to its desired bucket. Behavior is undefined if the given input is not distinct.
   * 
   * @tparam Value Input type which is convertible to 'value_type'
   *
   * @param value The element to insert
   */
  template <typename Value>
  __device__ void fast_insert(Value const& value) noexcept

Additional context

Would perfect hashing be a valid use case for multset/multimap as well? To avoid jumping into the black hole directly, we probably just focus on static_set/static_map for now.

Not sure if this is too hacky but couldn't we just instantiate the map with thread_scope_thread (and then of course implement that code path)?

This could introduce unnecessary costs in the case of perfect hashing where we know the hash values of all inputs are distinct and hash collisions would never happen.

Even if all the hash values are distinct, you could still end up with collisions when you do a modulo with the size the storage, right?

@jrhemstad is correct, which is why the test involves checking for the max of the hash.

In my vision, the test for whether it is okay to skip using atomics is that the hash function's is_perfect routine returns true and also that the hash function's max function, which returns the maximum possible value of the hash, is less than or equal to the number of slots in the hash.

I think that it's not a good idea to introduce new functions as above because they will not be easily found. I think that a better way could be to add the max and is_perfect functions to the hasher and allow the hasher to select which one will run. The kernel should be making the distinction between using atomics and not using atomics at compile time, with a template parameter. This makes it run faster than if we introduce a branch into the kernel. The whole point is performance!

For hasher's which do not implement max and is_perfect, Just assume that we must use atomics. Again, this is already known at compile time because the hasher is a template param. We can use SFINAE to detect this.

instantiate the map with thread_scope_thread

That's a brilliant idea. I don't see any big issues for now apart from the "hacky" fact that using thread scope is equivalent to no atomics involved in hash table internals.

Using the scope to eliminate atomics requires the caller to know whether or not atomics can be eliminated. I don't think that is a good idea. I don't think that we should require the caller to know about the internals of how the hashing is implemented. It will not be discoverable by users and, if we should change how hashing works in the future, it will break clients.

I would prefer for the hash function to know whether or not it is perfect and what its maximum is and then let the hash table determine whether or not atomics should be used. This is a better separation of concerns.

Using the scope to eliminate atomics requires the caller to know whether or not atomics can be eliminated.

Yes, that's my point. The philosophy is that cuco provides only the core functionality needed for doing hash table building/probing but exposes a set of customization points for users to tune their implementation. Like in your case you can introspect the data, which gives you the ability to define a custom hasher (customization point object), and also lets you determine whether or not atomic updates are required (thread_scope_device vs. thread_scope_thread).

I totally see the benefit of your approach, and I'm still impressed by the implementation. The question I'm asking myself is if this should go directly into cuco or rather a separate utility/helper repo.

I also don't know. But maybe we don't have to solve everything all at once. cuCo already provides multiple hash function. What about providing the "significant bits" hasher?

If we can get it working and it proves to be pretty fast, maybe add it to the rest of the hash functions?

I'm interested in this idea. This hasher is also introspecting the data in order to determine the significant bits, right? If we can press this implementation into a hasher that is consumable by cuco we can give it a try. Maybe open a GH so we can track this discussion separately.

@esoha-nvidia Daniel and I discussed this offline and we realized that in the case of perfect hashing, a normal hash table like cuco::static_map is somewhat bulky and a device vector of payloads probably works the best, e.g. with identity hash:

vector[identity_hash(key)] = payload; // insert
auto const found = vector[identity_hash(query_key)]; // find

On top of non-atomic insertions, it also gets rid of the expensive probing mechanism thus should be more effective than any hash tables.

It would be, in my opinion, nicer if cuCo could determine for itself if the hashing is perfect and then do all that behind the scenes. The advantage would be that the user could swap out a regular hash for a perfect hash with ease, by just selecting a different hash function, instead of needing to rewrite so much. Also, users might still want things like insert_and_find that's already implemented in cuCo.

In the same way that cuCo has optimizations when the key is of a certain bitwidth, it could also have optimizations for when the hash has certain properties (like being perfect, for example).

It would be, in my opinion, nicer if cuCo could determine for itself if the hashing is perfect

This would be nice to have if we could figure out a way other than doing data introspection during hash construction.

A perfect hash function not only affects the hasher itself but also changes the way that probes the container. With that, instead of fiddling around hasher and doing STL-unfriendly operations, I'm considering exposing a new probing scheme called, e.g. perfect_probing. It's the users' responsibility to make sure the hasher passed to perfect_probing is a perfect hash function and on the cuco side, we could do the necessary to guarantee that there would be no modulo or atomic operations involved.

users might still want things like insert_and_find that's already implemented in cuCo.

Totally makes sense

It would be valuable if you could benchmark the speedups brought by the device vector lookup table (#475 (comment)) in your case. It represents the SOL performance of perfect hashing IMO

I'm considering exposing a new probing scheme called, e.g., perfect_probing

That's a great idea! The signature could look like this:

template <class Hash>
class perfect_probing {
  static constexpr int32_t cg_size = 1; // There's no point in using a CG size larger than 1
  
  // ...
};

The probing iterator returned by the scheme's operator() will simply return storage[hash(key)] at the initial probing position and return end() // not found after the first increment.

A user must ensure that they pass a perfect hash function for the given storage capacity; otherwise, behavior is undefined.

With this probing scheme, we eliminate the unnecessary complex probing code path and also omit the expensive modulo operator.
This + the cuda::thread_scope_thread trick to eliminate any unnecessary atomic ops should yield the same performance as a simple vector access, as described by @PointKernel.

This would be nice to have if we could figure out a way other than doing data introspection during hash construction.

Perfect hashing without data introspection? That should be impossible unless your hash function is the identity function, right? Anyway, it is not the concern of cuCo which hash function the user chooses. cuCo should be able to work with any valid hash function!

A new probing scheme seems like a good solution. Should it also affect the equality operation? Under perfect hashing, you don't need to check for equality, right?

The three things that the user must do in order to elide the use of atomics:

  • hash must be perfect (hash(x)==hash(y) => x==y)
  • number of slots in hash table must be larger than the range of the hash (hash(x) < number_of_slots for all x)
  • range of hash can be written lock-free

The user can ensure the first two but what about the third? Maybe cuCo can provide a function for this? is_lock_free<key_type>(). It's a constexpr function. I know that this might be identical to cuda::atomic::is_lock_free or whatever function that cuda/std provides but cuCo ought to provide its own, in case the implementation of the hash table changes how it works.

Benchmarks

Here we compare a groupby operation with and without perfect hashing:

image

The groupby operation in SQL, which uses hashing, has improved from 7.448ms to 5.351ms, an improvement of 2.097ms which is 39% faster overall. Examining just the insert, we see that it has gone from 5.060ms to 3.072ms, which is basically the vast majority of that 2.097ms. (The rest is just noise in measurement.) This is a 64.7% improvement in hash insertion time.

Here is it for lookup:

image

The version with atomics has perfect hashing disabled and takes 943.444us. The one without atomics has perfect hashing enabled and takes 817.397us. Here the advantage comes not from eliding the atomics, which are not used during lookup, but the ability to elide the equality operator. The equality operators requires a dereference and reading of additional memory when cuCo is used in "row hasher" mode, which is what my code is does and also what cuDF does. This is a more modest gain of just 15%.

cuda::thread_scope_thread trick

BTW, one problem with this approach is that it is a trick. The need for this is going to be non-obvious to users. And you're depending on the fact that thread_scope_thread does not use atomics, which is perhaps true for a current architecture but not in the future. This is overloading the meaning of thread_scope_thread. Not lovely

Just put in each probing scheme a method private: bool is_perfect() constexpr;. Let it be true for the perfect probing and false for the others. Behind the scenes, do what you want to implement it, including using the thread_scope_thread. But don't make that something that the user has to know about.

It goes without saying that all this must be a branch at compile time, right? The insert kernel should be templated on whether or not there is perfect hashing. You can't do that branch in the kernel or it will rob you of the performance gains!

BTW, one problem with this approach is that it is a trick.

Agreed. cuco should take care of the thread scope internally for perfect hashing.

Also, considering double_hashing is one of the probing schemes, perfect_hashing is a better name compared to perfect_probing IMO.

Should it also affect the equality operation? Under perfect hashing, you don't need to check for equality, right?

Right, that's another thing we have to figure out. In cuco we pass the KeyEqual separately from the ProbingScheme (which contains the Hasher). If the Hasher is a perfect hasher, then we don't need to perform actual key/slot comparisons. Instead, we assume (or better require) that all keys that hash to the same slot are bitwise identical. Otherwise we run into a race condition if we're not using atomics. So how do we enforce that? It seems weird if we add something like this to the docs: "If the hasher is a perfect hasher, then the KeyEqual type passed by the user will be fully ignored and instead bitwise equality will be assumed".

Maybe cuCo can provide a function for this? is_lock_free<key_type>()

Currently, cuco requires (cuda::)std::atomic<key_type>::is_lock_free() == true for all key types. That's just a remnant from us switching from std::atomic to std::atomic_ref. Since the _ref means it's only a lightweight wrapper around an existing (unsynchronized) object that then behaves like an atomic, there's no way to place a mutex into the object like it's the case with std::atomic<T> for when std::atomic<T>::is_lock_free() == false.

There's one more problem which I'm trying to wrap my head around. If the user specifies thread_scope_block but the container will instead implicitly use thread_scope_thread (since its a perfect hash function), would that be a foot gun in terms of memory ordering? thread_scope_thread implies a plain STG instruction during insert, which (without any additional syncs or fences) will only be visible to the current thread. If the user however assumes thread_scope_block this might lead to some confusing errors.
For example: Lets say we have two threads t1 and t2 both calling set.insert_and_find(42) on the same set in global memory. The correct behavior in this scenario is that only one of the threads wins the slot, while the other just returns the key that is already materialized in the slot, i.e. the return values are either t1->{42, true} and t2->{42, false} or t1->{42, false} and t2->{42, true}. However, due to the aforementioned memory ordering shenanigans we might end up with t1->{42, true} and t2->{42, true}, since t2 might not be able to observe the key written by t1 and vice-versa.

Oh, one more thing I forgot to mention: How do we handle heterogeneous insert/lookup (blog post) in this case? We should disable it altogether, right? There are so many things to look out for to get this feature going -.-

  • key equal is not needed during insertion but is still desired during the query
  • heterogeneous insert should be fine as long as the slot type can be constructed from iterator's value_type
  • what are the potential issues supporting heterogeneous lookup for perfect hashing?

I don't think that you need key equal during the lookup. If the slot is not empty, it's a match.

Heterogeneous lookup should work fine. The data is unaffected by the removal of the use of atomics, right? We still read and write the same bits, we just do it without atomics because we have confidence that there will be no collisions.

If the slot is not empty, it's a match.

💡 Oh, perfect hashing is even more perfect than I thought

Yes it is!

Perfect hashing means that hash(x) == hash(y) => x==y. That is, if two elements hash to the same result, they are sure to be the same value. If the hash table is larger than the range of the hash, that is hash_table_size >= hash(a) for all a, then the insertion into the hash which writes to hash(x) % hash_table_size will be the same as hash(x). So if hash(y) is occupied when we lookup y, it must be that it was occupied by something for which hash(x) == hash(y). And from the definition of perfect hashing above, that implies x == y. So you don't have to do a comparison, the fact that the slot is not empty is enough.

I added a proposal for the perfect hashing probing scheme in #487. Let me know if I missed anything.

Closing this as surpassed by #487