NVIDIA / cuCollections

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[ENHANCEMENT]: Get rid of of custom atomic operations once CCCL 2.4 is ready

PointKernel opened this issue · comments

Is your feature request related to a problem? Please describe.

The current cuco implementations use custom atomic functions, e.g.

template <typename T>
__device__ constexpr auto compare_and_swap(T* address, T expected, T desired)
{
// temporary workaround due to performance regression
// https://github.com/NVIDIA/libcudacxx/issues/366
if constexpr (sizeof(T) == sizeof(unsigned int)) {
auto* const slot_ptr = reinterpret_cast<unsigned int*>(address);
auto const* const expected_ptr = reinterpret_cast<unsigned int*>(&expected);
auto const* const desired_ptr = reinterpret_cast<unsigned int*>(&desired);
if constexpr (Scope == cuda::thread_scope_system) {
return atomicCAS_system(slot_ptr, *expected_ptr, *desired_ptr);
} else if constexpr (Scope == cuda::thread_scope_device) {
return atomicCAS(slot_ptr, *expected_ptr, *desired_ptr);
} else if constexpr (Scope == cuda::thread_scope_block) {
return atomicCAS_block(slot_ptr, *expected_ptr, *desired_ptr);
} else {
static_assert(cuco::dependent_false<decltype(Scope)>, "Unsupported thread scope");
}
} else if constexpr (sizeof(T) == sizeof(unsigned long long int)) {
auto* const slot_ptr = reinterpret_cast<unsigned long long int*>(address);
auto const* const expected_ptr = reinterpret_cast<unsigned long long int*>(&expected);
auto const* const desired_ptr = reinterpret_cast<unsigned long long int*>(&desired);
if constexpr (Scope == cuda::thread_scope_system) {
return atomicCAS_system(slot_ptr, *expected_ptr, *desired_ptr);
} else if constexpr (Scope == cuda::thread_scope_device) {
return atomicCAS(slot_ptr, *expected_ptr, *desired_ptr);
} else if constexpr (Scope == cuda::thread_scope_block) {
return atomicCAS_block(slot_ptr, *expected_ptr, *desired_ptr);
} else {
static_assert(cuco::dependent_false<decltype(Scope)>, "Unsupported thread scope");
}
}
}
due to a performance regression with cuda::atomic_ref (NVIDIA/cccl#1008). With the fix being merged into the main branch, we can get rid of those custom functions once CCCL 2.4 is fetched by rapids-cmake

Describe the solution you'd like

Replace

__device__ constexpr auto compare_and_swap(T* address, T expected, T desired)

__device__ constexpr void atomic_store(T* address, T value)

__device__ constexpr void update_max(int i, register_type value) noexcept

with corresponding atomic_ref operations.

I think #502 unblocks this issue.

I think #502 unblocks this issue.

Right. But #502 is blocked by some CCCL issues.

#502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

If this moves forward, I can get rid of the walkaround in my work.

#502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

If this moves forward, I can get rid of the walkaround in my work.

This work is on my radar but I'm not following the issue you brought up here. The current cuco should work fine on Pascal or newer GPUs.

#502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.
If this moves forward, I can get rid of the walkaround in my work.

This work is on my radar but I'm not following the issue you brought up here. The current cuco should work fine on Pascal or newer GPUs.

We compile our library for older GPUs as well such as sm_52, that is what I had meant, filtering these older architectures and creating a seperate target for the codes using cuco. https://github.com/dmlc/dgl/blob/713ffb5714dbd0bd92b26d6412f5040092179ee3/graphbolt/CMakeLists.txt#L61-L69

Thank you for your work!

Or can I already start using cuco when I am compiling for sm_52, sm_60, sm_70, sm_80 etc, especially including sm_52 since it does not have the required support. Previously, I was getting a compilation error because cuda/atomic header was not importable when the minimum architecture is older than Pascal.

Note: It is ensured that cuco code path is not taken if running on an old GPU. This has to do with compilation flow.

Previously, I was getting a compilation error because cuda/atomic header was not importable when the minimum architecture is older than Pascal.

Yeah, I remember this but I'm not sure whether the new cuda/atomic with CCCL upgrade can solve the issue. Will back to you.

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

Is it because NVIDIA/cccl#1083 is still open?

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

Is it because NVIDIA/cccl#1083 is still open?

Yes