NVIDIA / cub

I am implementing a CSR2COO conversion function similar in functionality to cusparse::csr2coo using DeviceMemcpy::Batched by passing a thrust iterator as the input buffer but I am getting the following error:

static assertion failed with "The batched memcpy only supports copying of memory buffers"

Is such a compile check really necessary? The following code snippet includes my use case, the error occurs because input_buffer is a transform iterator and the conversion from thrust::constant_iterator to void* is not possible:

struct RepeatIndex {
  template <typename IdType>
  __host__ __device__ auto operator()(IdType i) {
    return thrust::make_constant_iterator(i);
  }
};

template <typename IdType>
struct OutputBufferIndexer {
  const IdType *indptr;
  const IdType *buffer;
  __host__ __device__ auto operator()(IdType i) {
    return buffer + indptr[i];
  }
};

template <typename IdType>
struct AdjacentDifference {
  const IdType *indptr;
  __host__ __device__ auto operator()(IdType i) {
    return indptr[i + 1] - indptr[i];
  }
};

thrust::counting_iterator<int64_t> iota(0);

auto input_buffer = thrust::make_transform_iterator(iota, RepeatIndex{});
auto output_buffer = thrust::make_transform_iterator(
    iota, OutputBufferIndexer<int64_t>
    {csr.indptr.Ptr<int64_t>(), ret_row.Ptr<int64_t>()});
auto buffer_sizes = thrust::make_transform_iterator(
    iota, AdjacentDifference<int64_t>{csr.indptr.Ptr<int64_t>()});

std::size_t temp_storage_bytes = 0;
CUDA_CALL(cub::DeviceMemcpy::Batched(
    nullptr, temp_storage_bytes, input_buffer, output_buffer,
    buffer_sizes, csr.num_rows, stream));

auto temp = allocator.alloc_unique<char>(temp_storage_bytes);

CUDA_CALL(cub::DeviceMemcpy::Batched(
    temp.get(), temp_storage_bytes, input_buffer, ret_row.Ptr<int64_t>(),
    buffer_sizes, csr.num_rows, stream));

I am also wondering if it is crucial for performance to use uint32_t for number of buffers, as for graphs with more than 2^32 vertices, this code wouldn't work.

Given buffer sizes [3 2 1 2], the goal is to produce the output: [0 0 0 1 1 2 3 3], basically implementing runlengthdecode as the reverse operation for run_length_encode.

Upon further investigation, the line making the contiguous buffer assumption is:

cub/cub/device/dispatch/dispatch_batch_memcpy.cuh

Line 418 in b7e042d

using BlevBufferSrcsOutItT = void **;

If we could generalize the type of the source pointers there from void** to some templated type buffer_t*, so that buffer_t could be a thrust::constant_iterator<T>, this could work.

Thank you for your question and for providing more information about your use case, @mfbalin!

DeviceMemcpy::Batched originally has been designed with handling memory buffers in mind, as that was the immediate use case. The algorithm uses optimisations that are applicable to memory buffers only. Specifically, the implementation for copying medium and large buffers (referred to as wlev and blev buffers, respectively) is using aliased loads and vectorized stores.

That said, we understand that there is demand to extend support of DeviceMemcpy::Batched for fancy iterators to "provide" the data of a buffer.

I think the best path forward is to specialize a few related methods, such as BatchMemcpyWLEVBuffers and EnqueueBLEVBuffers, when an iterator instead of a pointer is provided for a buffer. BatchMemcpyWLEVBuffers would need to use element-wise copies. EnqueueBLEVBuffers would write out buffer indexes of affected buffers instead of raw pointers. The "consumer" of EnqueueBLEVBuffers is the MultiBlockBatchMemcpyKernel that would also need to be specialised.

I will open an issue for it. Please let us know if you were interested in contributing such an extension.

DeviceMemcpy::Batched supports only memory buffers