It is good to understand some basic ideas of GPU parallelism, and getting familiar with some parallelism algorithms. And I have coded some simple CUDA C examples. However, from my experience, it is quite difficult to master most CUDA parallelism algorithms because of the complexity. For example, sorting in CUDA is way more complicated than sorting in serial. To write large scale code, using CUDA libraries is a must.
Some useful libraries in CUDA (either by Nvidia or third party):
- cuBLAS -- BLAS
- cuFFT -- 1D, 2D, 3D FFT
- cuSPARSE -- BLAS-like routines for sparse matrix
- cuRAND -- Pseudo- and quasi-random generation routines
- NPP -- Low-level image processing primitives
- Magma -- GPU + multicore CPU LAPACK routines
- CULA -- Eigensolvers, matrix factorizations and solvers
- ArrayFire -- Framework for data-parallel array manipulation
- cuDNN -- Deep Neural Networks
Lower-level libraries:
- thrust -- like C++ STL -- host-side interface, no kernels, cannot set thread parameters (e.g. number of blocks, number of threads, shared memory)
- CUB -- more control
- CUDPP -- CUDA Data Parallel Primitives Library
Simple Code Examples:
- Hello world
- Vector Add v1 -- one block, v2 -- several blocks
- Matrix Multiply, global mem and shared mem
- Reduce, 1 block vs arbitrary block
- Scan, 1 block vs arbitrary block
- Histogram, atomic add
- Unified memory
- Stencil 1D
- Radix Sort
I also worked on the homework assignments of Udacity GPU class cs344. Check my Solutions here. Udacity GPU class projects:
- Map
- 2D Stencil (2D convolution)
- Histogram, Reduce, small Scan
- Histogram, Compact, large Scan, Radix sort
Code:
hello.cu
First program in CUDA. Say hello. This is to understand the block-thread structure.
int main() {
int blocksize=2; int N=3;
hello<<<blocksize,N>>>();
}
output:
Hello world! block ID 1, thread ID 0
Hello world! block ID 1, thread ID 1
Hello world! block ID 1, thread ID 2
Hello world! block ID 0, thread ID 0
Hello world! block ID 0, thread ID 1
Hello world! block ID 0, thread ID 2
Do some simple calculation with CUDA. The key idea is to let each thread do one job if possible. Spread the work to all the thread. Note: Memory initialization in Host, Device and memory copy between Host and Device is really verbose...
Code:
vectorAdd.cu
Allows 1 block
// Kernal, call on Host, run on Device
__global__
void vectoradd(int* a, int *b, int *c){
c[threadIdx.x]=a[threadIdx.x]+b[threadIdx.x];
}
Code:
vectorAdd2.cu
Allows many blocks
__global__
void vectoradd2(int* a, int *b, int *c, int N){
int idx=threadIdx.x + blockIdx.x*blockDim.x;
if(idx<N)
c[idx]=a[idx]+b[idx];
}
My code allows arbitrary size of the Matrix. Code is modified from the Example in Nvidia CUDA guide.
- Using global memory
Code: Matrix_multiply_gmem.cu
- Using shared memory
Code: Matrix_multiply_shmem.cu
Number of threads in a block need to be power of 2 (automatically taken care of in the code). Otherwise result is not correct. Check Udacity GPU class note.
Use shared memory for performance. Shared memory assigned when calling the kernal.kernal<<<blockDim, threadDim, sharedMemSpace>>>()
- Reduce with small input (1 block)
Code: reduce_small.cu
- Reduce with large input (many blocks)
Code: reduce_large.cu
Use a temporary array to store the reduced result from each block.
Interestingly, the scan code I studied from Nvidia website (http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html) has bug and does not work correctly. How could you do this temp[pout*n+thid] += temp[pin*n+thid - offset];
? It should be temp[pout*n+thid] = temp[pin*n+thid] + temp[pin*n+thid - offset];
.
- Scan with small input (1 block)
Code: scan_small.cu
Scan with 1 block due to small number of elements to scan.
- Scan with large input (many blocks)
Code: scan_large.cu
Use a temporary array to store the exclusively-scanned histogram from each block.
Code: atomic_histogram.cu
AtomicAdd() is making the parallel code serial, significantly slow the code. Here is some algorithm to make it fast (https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/).
Code: unified_memory.cu
This is a pain-killer to significantly simply the memory allocation-copy process!
It is instructive to compare zero-copy and unified memory. For the former, the memory is allocated in page-locked fashion on the host. A device thread has to reach out to get the data. No guarantee of coherence is provided as, for instance, the host could change the content of the pinned memory while the device reads its content.
Code: Stencil1d_sum.cu
Do a 1D stencil sum. Use shared memory for performance.
Code: radix_sort.cu
Not easy to do compared with the CPU version... Use histogram, compact, scan, then move. Check here for explanation of the radix sort. But my method is slightly different.