README

This a CUDA C parallel implementation of the Merge operation.
Algorithm's parallelization is based on the co-rank function provided by Siebert et al. in their work Efficient MPI Implementation of a Parallel, Stable Merge Algorithm.

Project Structure

Here-in Python scripts for plotting speedups has been provided.

Here-in CUDA C and C source code can be found.

Tests were done on an Intel (R) Xeon(R) CPU E5440 and a Nvidia GeForce GTX 980 GPU.
Integer arrays with $2^{24}$ elements has been used.

Serial Merge requires 73.16 ms for completing.
Basic Parallel Merge requires 16.77 ms for completing, 4x faster than Serial Merge.
Tiled Parallel Merge requires 1.11 ms for completing, 66x faster than Serial Merge.
Circular Parallel Merge requires 2.14 ms for completing, 34x faster than Serial Merge.

CUDA C parallel implementation of the Merge operation.

Other

Language:Cuda 81.5%Language:Python 9.8%Language:C 8.7%