[FEATURE]: Support fast insert operations for single set/map
PointKernel opened this issue · comments
Is your feature request related to a problem? Please describe.
Currently, cuco hash tables always use atomic CAS to for table insertions.
This could introduce unnecessary costs in the case of perfect hashing where we know the hash values of all inputs are distinct and hash collisions would never happen.
Describe the solution you'd like
Inspired by @esoha-nvidia's comments on perfect hashing, we could pass an is_perfect
/is_perfect_hashing
argument to insert
functions thus and is_perfect == true
, the input element is copied to the target bucket directly. Otherwise, it's atomically CASed to the desired bucket.
Describe alternatives you've considered
Apart from adding an extra parameter is_perfect
to existing APIs, we could also introduce a new API instead.
The host-bulk API could look like:
/**
* @brief Asynchronously inserts all values in the range `[first, last)`.
*
* @note: All input values must be distinct. Behavior is undefined if duplicates are present in the input.
*
* @tparam InputIt Device accessible random access input iterator where
* <tt>std::is_convertible<std::iterator_traits<InputIt>::value_type,
* static_map<K, V>::value_type></tt> is `true`
*
* @param first Beginning of the sequence of values
* @param last End of the sequence of values
* @param stream CUDA stream used for insert
*/
template <typename InputIt>
void fast_insert_async(InputIt first, InputIt last, cuda_stream_ref stream = {}) noexcept;
The device API:
/**
* @brief Copies the given element to its corresponding bucket
*
* @note: This function performs a non-atomic hash table insertion by copying the given element
* directly to its desired bucket. Behavior is undefined if the given input is not distinct.
*
* @tparam Value Input type which is convertible to 'value_type'
*
* @param value The element to insert
*/
template <typename Value>
__device__ void fast_insert(Value const& value) noexcept
Additional context
Would perfect hashing be a valid use case for multset/multimap as well? To avoid jumping into the black hole directly, we probably just focus on static_set
/static_map
for now.
Not sure if this is too hacky but couldn't we just instantiate the map with thread_scope_thread
(and then of course implement that code path)?
This could introduce unnecessary costs in the case of perfect hashing where we know the hash values of all inputs are distinct and hash collisions would never happen.
Even if all the hash values are distinct, you could still end up with collisions when you do a modulo with the size the storage, right?
@jrhemstad is correct, which is why the test involves checking for the max of the hash.
In my vision, the test for whether it is okay to skip using atomics is that the hash function's is_perfect
routine returns true
and also that the hash function's max
function, which returns the maximum possible value of the hash, is less than or equal to the number of slots in the hash.
I think that it's not a good idea to introduce new functions as above because they will not be easily found. I think that a better way could be to add the max
and is_perfect
functions to the hasher and allow the hasher to select which one will run. The kernel should be making the distinction between using atomics and not using atomics at compile time, with a template parameter. This makes it run faster than if we introduce a branch into the kernel. The whole point is performance!
For hasher's which do not implement max
and is_perfect
, Just assume that we must use atomics. Again, this is already known at compile time because the hasher is a template param. We can use SFINAE to detect this.
instantiate the map with thread_scope_thread
That's a brilliant idea. I don't see any big issues for now apart from the "hacky" fact that using thread scope is equivalent to no atomics involved in hash table internals.
Using the scope to eliminate atomics requires the caller to know whether or not atomics can be eliminated. I don't think that is a good idea. I don't think that we should require the caller to know about the internals of how the hashing is implemented. It will not be discoverable by users and, if we should change how hashing works in the future, it will break clients.
I would prefer for the hash function to know whether or not it is perfect and what its maximum is and then let the hash table determine whether or not atomics should be used. This is a better separation of concerns.
Using the scope to eliminate atomics requires the caller to know whether or not atomics can be eliminated.
Yes, that's my point. The philosophy is that cuco provides only the core functionality needed for doing hash table building/probing but exposes a set of customization points for users to tune their implementation. Like in your case you can introspect the data, which gives you the ability to define a custom hasher (customization point object), and also lets you determine whether or not atomic updates are required (thread_scope_device
vs. thread_scope_thread
).
I totally see the benefit of your approach, and I'm still impressed by the implementation. The question I'm asking myself is if this should go directly into cuco or rather a separate utility/helper repo.
I also don't know. But maybe we don't have to solve everything all at once. cuCo already provides multiple hash function. What about providing the "significant bits" hasher?
If we can get it working and it proves to be pretty fast, maybe add it to the rest of the hash functions?
I'm interested in this idea. This hasher is also introspecting the data in order to determine the significant bits, right? If we can press this implementation into a hasher that is consumable by cuco we can give it a try. Maybe open a GH so we can track this discussion separately.
@esoha-nvidia Daniel and I discussed this offline and we realized that in the case of perfect hashing, a normal hash table like cuco::static_map
is somewhat bulky and a device vector of payloads probably works the best, e.g. with identity hash:
vector[identity_hash(key)] = payload; // insert
auto const found = vector[identity_hash(query_key)]; // find
On top of non-atomic insertions, it also gets rid of the expensive probing mechanism thus should be more effective than any hash tables.
It would be, in my opinion, nicer if cuCo could determine for itself if the hashing is perfect and then do all that behind the scenes. The advantage would be that the user could swap out a regular hash for a perfect hash with ease, by just selecting a different hash function, instead of needing to rewrite so much. Also, users might still want things like insert_and_find
that's already implemented in cuCo.
In the same way that cuCo has optimizations when the key is of a certain bitwidth, it could also have optimizations for when the hash has certain properties (like being perfect, for example).
It would be, in my opinion, nicer if cuCo could determine for itself if the hashing is perfect
This would be nice to have if we could figure out a way other than doing data introspection during hash construction.
A perfect hash function not only affects the hasher itself but also changes the way that probes the container. With that, instead of fiddling around hasher and doing STL-unfriendly operations, I'm considering exposing a new probing scheme called, e.g. perfect_probing
. It's the users' responsibility to make sure the hasher passed to perfect_probing
is a perfect hash function and on the cuco side, we could do the necessary to guarantee that there would be no modulo or atomic operations involved.
users might still want things like insert_and_find that's already implemented in cuCo.
Totally makes sense
It would be valuable if you could benchmark the speedups brought by the device vector lookup table (#475 (comment)) in your case. It represents the SOL performance of perfect hashing IMO
I'm considering exposing a new probing scheme called, e.g., perfect_probing
That's a great idea! The signature could look like this:
template <class Hash>
class perfect_probing {
static constexpr int32_t cg_size = 1; // There's no point in using a CG size larger than 1
// ...
};
The probing iterator returned by the scheme's operator()
will simply return storage[hash(key)]
at the initial probing position and return end() // not found
after the first increment.
A user must ensure that they pass a perfect hash function for the given storage capacity; otherwise, behavior is undefined.
With this probing scheme, we eliminate the unnecessary complex probing code path and also omit the expensive modulo operator.
This + the cuda::thread_scope_thread
trick to eliminate any unnecessary atomic ops should yield the same performance as a simple vector access, as described by @PointKernel.
This would be nice to have if we could figure out a way other than doing data introspection during hash construction.
Perfect hashing without data introspection? That should be impossible unless your hash function is the identity function, right? Anyway, it is not the concern of cuCo which hash function the user chooses. cuCo should be able to work with any valid hash function!
A new probing scheme seems like a good solution. Should it also affect the equality operation? Under perfect hashing, you don't need to check for equality, right?
The three things that the user must do in order to elide the use of atomics:
- hash must be perfect (
hash(x)==hash(y) => x==y
) - number of slots in hash table must be larger than the range of the hash (
hash(x) < number_of_slots
for all x) - range of hash can be written lock-free
The user can ensure the first two but what about the third? Maybe cuCo can provide a function for this? is_lock_free<key_type>()
. It's a constexpr
function. I know that this might be identical to cuda::atomic::is_lock_free
or whatever function that cuda/std provides but cuCo ought to provide its own, in case the implementation of the hash table changes how it works.
Benchmarks
Here we compare a groupby operation with and without perfect hashing:
The groupby operation in SQL, which uses hashing, has improved from 7.448ms to 5.351ms, an improvement of 2.097ms which is 39% faster overall. Examining just the insert, we see that it has gone from 5.060ms to 3.072ms, which is basically the vast majority of that 2.097ms. (The rest is just noise in measurement.) This is a 64.7% improvement in hash insertion time.
Here is it for lookup:
The version with atomics has perfect hashing disabled and takes 943.444us. The one without atomics has perfect hashing enabled and takes 817.397us. Here the advantage comes not from eliding the atomics, which are not used during lookup, but the ability to elide the equality operator. The equality operators requires a dereference and reading of additional memory when cuCo is used in "row hasher" mode, which is what my code is does and also what cuDF does. This is a more modest gain of just 15%.
cuda::thread_scope_thread
trick
BTW, one problem with this approach is that it is a trick. The need for this is going to be non-obvious to users. And you're depending on the fact that thread_scope_thread
does not use atomics, which is perhaps true for a current architecture but not in the future. This is overloading the meaning of thread_scope_thread
. Not lovely
Just put in each probing scheme a method private: bool is_perfect() constexpr;
. Let it be true for the perfect probing and false for the others. Behind the scenes, do what you want to implement it, including using the thread_scope_thread
. But don't make that something that the user has to know about.
It goes without saying that all this must be a branch at compile time, right? The insert kernel should be templated on whether or not there is perfect hashing. You can't do that branch in the kernel or it will rob you of the performance gains!
BTW, one problem with this approach is that it is a trick.
Agreed. cuco should take care of the thread scope internally for perfect hashing.
Also, considering double_hashing
is one of the probing schemes, perfect_hashing
is a better name compared to perfect_probing
IMO.
Should it also affect the equality operation? Under perfect hashing, you don't need to check for equality, right?
Right, that's another thing we have to figure out. In cuco we pass the KeyEqual
separately from the ProbingScheme
(which contains the Hasher
). If the Hasher
is a perfect hasher, then we don't need to perform actual key/slot comparisons. Instead, we assume (or better require) that all keys that hash to the same slot are bitwise identical. Otherwise we run into a race condition if we're not using atomics. So how do we enforce that? It seems weird if we add something like this to the docs: "If the hasher is a perfect hasher, then the KeyEqual
type passed by the user will be fully ignored and instead bitwise equality will be assumed".
Maybe cuCo can provide a function for this? is_lock_free<key_type>()
Currently, cuco requires (cuda::)std::atomic<key_type>::is_lock_free() == true
for all key types. That's just a remnant from us switching from std::atomic
to std::atomic_ref
. Since the _ref
means it's only a lightweight wrapper around an existing (unsynchronized) object that then behaves like an atomic, there's no way to place a mutex into the object like it's the case with std::atomic<T>
for when std::atomic<T>::is_lock_free() == false
.
There's one more problem which I'm trying to wrap my head around. If the user specifies thread_scope_block
but the container will instead implicitly use thread_scope_thread
(since its a perfect hash function), would that be a foot gun in terms of memory ordering? thread_scope_thread
implies a plain STG
instruction during insert, which (without any additional syncs or fences) will only be visible to the current thread. If the user however assumes thread_scope_block
this might lead to some confusing errors.
For example: Lets say we have two threads t1
and t2
both calling set.insert_and_find(42)
on the same set
in global memory. The correct behavior in this scenario is that only one of the threads wins the slot, while the other just returns the key that is already materialized in the slot, i.e. the return values are either t1->{42, true} and t2->{42, false}
or t1->{42, false} and t2->{42, true}
. However, due to the aforementioned memory ordering shenanigans we might end up with t1->{42, true} and t2->{42, true}
, since t2
might not be able to observe the key written by t1
and vice-versa.
Oh, one more thing I forgot to mention: How do we handle heterogeneous insert/lookup (blog post) in this case? We should disable it altogether, right? There are so many things to look out for to get this feature going -.-
- key equal is not needed during insertion but is still desired during the query
- heterogeneous insert should be fine as long as the slot type can be constructed from iterator's
value_type
- what are the potential issues supporting heterogeneous lookup for perfect hashing?
I don't think that you need key equal during the lookup. If the slot is not empty, it's a match.
Heterogeneous lookup should work fine. The data is unaffected by the removal of the use of atomics, right? We still read and write the same bits, we just do it without atomics because we have confidence that there will be no collisions.
If the slot is not empty, it's a match.
💡 Oh, perfect hashing is even more perfect than I thought
Yes it is!
Perfect hashing means that hash(x) == hash(y) => x==y
. That is, if two elements hash to the same result, they are sure to be the same value. If the hash table is larger than the range of the hash, that is hash_table_size >= hash(a)
for all a, then the insertion into the hash which writes to hash(x) % hash_table_size
will be the same as hash(x)
. So if hash(y)
is occupied when we lookup y
, it must be that it was occupied by something for which hash(x) == hash(y)
. And from the definition of perfect hashing above, that implies x == y
. So you don't have to do a comparison, the fact that the slot is not empty is enough.
I added a proposal for the perfect hashing probing scheme in #487. Let me know if I missed anything.
Closing this as surpassed by #487