Trove

Trove is a CUDA library that provides efficient vector loads and stores. It works for CUDA architectures 3.0 and above, and uses no CUDA shared memory, making it easy to integrate. It is useful for working with data in an Array of Structures format, and also when writing code that consumes or produces an array of data per CUDA thread.

How it Works

This functionality is built out of a transposition routine that uses the warp shuffle intrinsic to redistribute data amongst threads in the CUDA warp. For example, when every thread in the warp is loading contiguous structures from an array, the threads collaboratively load all the data needed by the warp, using coalesced memory accesses, then transpose the data to redistribute it to the correct thread.

The following cartoon illustrates how this works, for a notional warp with eight threads.

Performance

Accesses to arrays of structures can be 6X faster than direct memory accesses using compiler generated loads and stores. The following benchmarks were taken on a Tesla K20c. The structure T being loaded and stored in these benchmarks is made of one to sixteen 32-bit integers.

High-level Interface

#include <trove/ptr.h>

template<typename T>
__global__ void
trove_gather(const int length, const int* indices,
             trove::coalesced_ptr<T> src, //Wrapped pointer
             trove::coalesced_ptr<T> dst) {
    int global_index = threadIdx.x + blockDim.x * blockIdx.x;
    if (global_index < length) {
        int index = indices[global_index];
        T data = src[index];
        dst[global_index] = data;
    }
}

The high-level interface allows you to load and store structures directly. Just wrap the pointers of your arrays in trove::coalesced_ptr<T>. You don't need to worry if the warp is converged, or if the addresses you're accessing are contiguous. This interface loses some performance, since it has to dynamically check whether the warp is converged, and also broadcast all pointers from all threads in each warp to all other threads in the warp, but it is simple to use.

There are two restrictions on T:

sizeof(T) must be divisible by 4
T must have a default constructor

Block Interface

It's common for CUDA code to process or produce several values per thread. For example, a merge operation may process 7 values per thread, to increase the amount of serial work. For these cases, we provide a blocked interface that enables efficient block-wise vector loads and stores.

This interface relies on an array type trove::array<T, s>, where T is the type of each element of the array, and s is an integer that statically determines the length of the array. trove::array types can be converted to and from standard C arrays (see array.h ), but they have value semantics rather than reference semantics, and they can only be indexed statically.

With this interface, each thread is assumed to be reading or writing from contiguous locations in the input or output array. The user is responsible for checking for convergence, which they probably do anyway for functional reasons. If the warp is not converged, we provide fallback functions that load and store arrays using compiler generated code.

#include <trove/block.h>

template<typename T, int s>
__global__ void test_block_copy(const T* x, T* r, int l) {
    typedef trove::array<T, s> s_ary;
    int global_index = threadIdx.x + blockIdx.x * blockDim.x;

    for(int index = global_index; index < l; index += gridDim.x * blockDim.x) {

        //The block memory accesses only function
        //correctly if the warp is converged. Here we check.
        if (trove::warp_converged()) {
            //Warp converged, indices are contiguous, call the fast
            //load and store
            s_ary d = trove::load_array_warp_contiguous<s>(x, index);
            trove::store_array_warp_contiguous(r, index, d);
        } else {
            //Warp not converged, call the slow load and store
            s_ary d = trove::load_array<s>(x, index);
            trove::store_array(r, index, d);
        }
    }
}

Low-level Interface

If you know your warp is converged, and that your threads are all accessing contiguous locations in your array, you can use the low-level interface for maximum performance. By contiguous, we mean that if threads with indices i and thread j are in the same warp, the pointers pi and pj you pass to the library must obey the relation pj - pi == j - i. The low-level interface has the following functions in <trove/aos.h>:

template<typename T> __device__ T load_warp_contiguous(const T* src);

template<typename T> __device__ void store_warp_contiguous(const T& data, T* dest);

Peter9606 / trove