The Sparkler miniapp computes a specialized dense matrix-matrix product C = A^T A for small integer elements of the matrix A. This operation mimics the matrix product operation used to compute the Custom Corellation Coefficient (CCC) in the CoMet computational genomics code.
The build requires MPI and make. The default build requires CUDA 9.2 or higher for NVIDIA GPUs. An alternative build path for CPU-only execution requires an installed BLAS library, preferably multithreaded if the runs use more than one core per MPI rank.
To build for a cluster, modify the Makefile to reflect your MPI and CUDA installs and then type "make" (GPU case) or "env USE_GPU=NO make" (CPU-only case).
Running the GPU executable requires one or more NVIDIA GPUs. Volta V100 or later (compute capability 7.0 or higher) GPUs are preferred; older GPUs will run much slower due to lack of tensor core hardware.
A run is composed of a series of iterations, each representing a global dense matrix-matrix product. A single iteration is composed of steps, each corresponding to a single GEMM executed on each GPU.
Command-line options:
--num_vector - number of vectors (half the number of columns of matrix A)
--num_field - number of fields (the number of rows of A)
--num_iterations - number of (global) matrix products done
Example:
mpirun -n 2 ./exec.cpu --num_vector 1000 --num_field 2000 --num_iterations 2
Reported values are:
TF - teraflops, total number of GEMM floating point operations
GEMM sec - total time spent in GPU GEMM operations
GEMM TF/sec - GEMM teraflop rate, ratio of TF to GEMM sec
total sec - total runtime
hash - a hash of the results computed, for evaluating correctness
There are 9 test cases, and total 30 scores.
Score | num_vector | num_field | Iteration |
---|---|---|---|
2 | 6000 | 1250 | 512 |
2 | 2000 | 7500 | 768 |
3 | 3200 | 5500 | 1024 |
3 | 2000 | 57600 | 200 |
2 | 1000 | 192000 | 300 |
1 | 1000 | 32000 | 600 |
1 | 800 | 50000 | 100 |
6 | 1000 | 12800 | 2560 |
10 | 8000 | 21600 | 100 |
Sample output:
summit-batch4$ mpirun -np 4 ./exec.cpu --num_vector 4000 --num_field 90000 --num_iterations 400
num_vector 4000 num_field 90000 num_iterations 400 num_proc 1
Iteration 1 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 2 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 4 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 8 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 16 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 32 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 64 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 128 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 256 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
Iteration 400 of 400, step 1 of 1, elapsed sec XXXXXX: setup... GEMM... check...
TF 4608.000 GEMM sec XXXXXX GEMM TF/sec XXXXXX total sec XXXXXX hash 435999930709XXXXXX