NVIDIA / cuCollections

Is your feature request related to a problem? Please describe.

The current cuco implementations use custom atomic functions, e.g.

cuCollections/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh

Lines 904 to 936 in 1c8b920

    
           template <typename T> 
        
           __device__ constexpr auto compare_and_swap(T* address, T expected, T desired) 
        
           { 
        
             // temporary workaround due to performance regression 
        
             // https://github.com/NVIDIA/libcudacxx/issues/366 
        
             if constexpr (sizeof(T) == sizeof(unsigned int)) { 
        
               auto* const slot_ptr           = reinterpret_cast<unsigned int*>(address); 
        
               auto const* const expected_ptr = reinterpret_cast<unsigned int*>(&expected); 
        
               auto const* const desired_ptr  = reinterpret_cast<unsigned int*>(&desired); 
        
               if constexpr (Scope == cuda::thread_scope_system) { 
        
                 return atomicCAS_system(slot_ptr, *expected_ptr, *desired_ptr); 
        
               } else if constexpr (Scope == cuda::thread_scope_device) { 
        
                 return atomicCAS(slot_ptr, *expected_ptr, *desired_ptr); 
        
               } else if constexpr (Scope == cuda::thread_scope_block) { 
        
                 return atomicCAS_block(slot_ptr, *expected_ptr, *desired_ptr); 
        
               } else { 
        
                 static_assert(cuco::dependent_false<decltype(Scope)>, "Unsupported thread scope"); 
        
               } 
        
             } else if constexpr (sizeof(T) == sizeof(unsigned long long int)) { 
        
               auto* const slot_ptr           = reinterpret_cast<unsigned long long int*>(address); 
        
               auto const* const expected_ptr = reinterpret_cast<unsigned long long int*>(&expected); 
        
               auto const* const desired_ptr  = reinterpret_cast<unsigned long long int*>(&desired); 
        
               if constexpr (Scope == cuda::thread_scope_system) { 
        
                 return atomicCAS_system(slot_ptr, *expected_ptr, *desired_ptr); 
        
               } else if constexpr (Scope == cuda::thread_scope_device) { 
        
                 return atomicCAS(slot_ptr, *expected_ptr, *desired_ptr); 
        
               } else if constexpr (Scope == cuda::thread_scope_block) { 
        
                 return atomicCAS_block(slot_ptr, *expected_ptr, *desired_ptr); 
        
               } else { 
        
                 static_assert(cuco::dependent_false<decltype(Scope)>, "Unsupported thread scope"); 
        
               } 
        
             } 
        
           }

due to a performance regression with cuda::atomic_ref (NVIDIA/cccl#1008). With the fix being merged into the main branch, we can get rid of those custom functions once CCCL 2.4 is fetched by rapids-cmake

Describe the solution you'd like

Replace

cuCollections/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh

Line 905 in 1c8b920

__device__ constexpr auto compare_and_swap(T* address, T expected, T desired)

cuCollections/include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh

Line 947 in 1c8b920

__device__ constexpr void atomic_store(T* address, T value)

cuCollections/include/cuco/detail/hyperloglog/hyperloglog_ref.cuh

Line 525 in 1c8b920

__device__ constexpr void update_max(int i, register_type value) noexcept

with corresponding atomic_ref operations.

I think #502 unblocks this issue.

I think #502 unblocks this issue.

Right. But #502 is blocked by some CCCL issues.

#502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

If this moves forward, I can get rid of the walkaround in my work.

#502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.

If this moves forward, I can get rid of the walkaround in my work.

This work is on my radar but I'm not following the issue you brought up here. The current cuco should work fine on Pascal or newer GPUs.

#502 is merged now. I had to go through big trouble to use cuco in my library, I had to make a separate target and constrain the CUDA architecture list to Volta+ only while rest of the sources are compiled for all given archs.
If this moves forward, I can get rid of the walkaround in my work.

This work is on my radar but I'm not following the issue you brought up here. The current cuco should work fine on Pascal or newer GPUs.

We compile our library for older GPUs as well such as sm_52, that is what I had meant, filtering these older architectures and creating a seperate target for the codes using cuco. https://github.com/dmlc/dgl/blob/713ffb5714dbd0bd92b26d6412f5040092179ee3/graphbolt/CMakeLists.txt#L61-L69

Thank you for your work!

Or can I already start using cuco when I am compiling for sm_52, sm_60, sm_70, sm_80 etc, especially including sm_52 since it does not have the required support. Previously, I was getting a compilation error because cuda/atomic header was not importable when the minimum architecture is older than Pascal.

Note: It is ensured that cuco code path is not taken if running on an old GPU. This has to do with compilation flow.

Previously, I was getting a compilation error because cuda/atomic header was not importable when the minimum architecture is older than Pascal.

Yeah, I remember this but I'm not sure whether the new cuda/atomic with CCCL upgrade can solve the issue. Will back to you.

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

Is it because NVIDIA/cccl#1083 is still open?

@mfbalin FYI, the corresponding PR gets merged but I think you still need a separate target for old GPUs.

Is it because NVIDIA/cccl#1083 is still open?

Yes

	template <typename T>
	__device__ constexpr auto compare_and_swap(T* address, T expected, T desired)
	{
	// temporary workaround due to performance regression
	// https://github.com/NVIDIA/libcudacxx/issues/366
	if constexpr (sizeof(T) == sizeof(unsigned int)) {
	auto* const slot_ptr = reinterpret_cast<unsigned int*>(address);
	auto const* const expected_ptr = reinterpret_cast<unsigned int*>(&expected);
	auto const* const desired_ptr = reinterpret_cast<unsigned int*>(&desired);
	if constexpr (Scope == cuda::thread_scope_system) {
	return atomicCAS_system(slot_ptr, expected_ptr, desired_ptr);
	} else if constexpr (Scope == cuda::thread_scope_device) {
	return atomicCAS(slot_ptr, expected_ptr, desired_ptr);
	} else if constexpr (Scope == cuda::thread_scope_block) {
	return atomicCAS_block(slot_ptr, expected_ptr, desired_ptr);
	} else {
	static_assert(cuco::dependent_false<decltype(Scope)>, "Unsupported thread scope");
	}
	} else if constexpr (sizeof(T) == sizeof(unsigned long long int)) {
	auto* const slot_ptr = reinterpret_cast<unsigned long long int*>(address);
	auto const* const expected_ptr = reinterpret_cast<unsigned long long int*>(&expected);
	auto const* const desired_ptr = reinterpret_cast<unsigned long long int*>(&desired);
	if constexpr (Scope == cuda::thread_scope_system) {
	return atomicCAS_system(slot_ptr, expected_ptr, desired_ptr);
	} else if constexpr (Scope == cuda::thread_scope_device) {
	return atomicCAS(slot_ptr, expected_ptr, desired_ptr);
	} else if constexpr (Scope == cuda::thread_scope_block) {
	return atomicCAS_block(slot_ptr, expected_ptr, desired_ptr);
	} else {
	static_assert(cuco::dependent_false<decltype(Scope)>, "Unsupported thread scope");
	}
	}
	}

[ENHANCEMENT]: Get rid of of custom atomic operations once CCCL 2.4 is ready

Is your feature request related to a problem? Please describe.

Describe the solution you'd like