Smith-Waterman performance optimizations

Created as a part of the CSC586C (Data Management on Modern Computer Architectures) course at University of Victoria under supervision of Sean Chester.

Contributors: Ze Shi Li, Rohith Pudari, Haris Smajlovic.

Repository contains performance optimizations for Linear gap Smith-Waterman algorithm.

Optimizations

Note: Check if your CPU supports SSE2/4, AVX2 and/or AVX-512 first. Otherwise the SIMD benchmark will still run, but with no valid results.

So far we have a baseline, bithacked, bithacked-striped, multicore-windowed, windowed, simd-alpern and multicore-alpern version of the very same algorithm for a CPU, and cuda-alpern, cuda-antidiagonal, cuda-hypothetical, and cuda-windowed for a GPU:

Baseline: A straight forward baseline version of the SW algorithm.
Bithacked: Baseline version with heavy branching replaced with bithacks.
Bithacked-striped: Bithacked version with an access pattern that is more L1 cache friendly.
Windowed: For a scenario in which traceback is not needed but only the best match score.
Multicore windowed: Using the technique above, spreads across multiple CPU cores.
SIMDed (Alpern technique): A SIMDed baseline using widest registers that your CPU supports and inter-alignment technique from Alpern et al.
Multicore (Alpern technique): Just a SIMDed technique above spread accross multiple CPU cores.
CUDA (Alpern technique): A SIMTed baseline using inter-alignment technique akin to SIMD Alpern technique above.
CUDA windowed: A SIMT implementation of a windowed version above.
CUDA antidiagonal: A 2-dimentional parallelisation exposing both inter and intra alignment parallelism.
CUDA hypothetical: A 3-dimentional parallelisation in which all data dependency is ignored and parallelization utilized to full extent.

Testing

In order to benchmark the CPU solutions use perf (for now -- sorry non-linux users). So just compile benchmark.cpp and then run perf on the executable.

For testing the GPU solutions just compile and run benchmark.cu.

Don't forget to provide version string as a CLI argument.

CPU Examples

For baseline set version=base in your bash
For bithacked set version=bithacked in your bash
For bithacked-striped set version=bithacked-striped in your bash
For windowed set version=windowed in your bash
For multicore-windowed set version=multicore-windowed in your bash
For SIMDed (Alpern technique) set version=simd-alpern in your bash
For multicore set version=multicore-alpern in your bash

and then do

exe_path=benchmark_${version}.out && \
g++ -D THRD_CNT=2 -march=native -fopenmp -Wall -Og -std=c++17 -o $exe_path benchmark.cpp && \
perf stat -e cycles:u,instructions:u ./$exe_path $version

GPU Examples

For CUDA (Aplern technique) set version=cuda-alpern in your bash
For CUDA windowed set version=cuda-windowed in your bash
For CUDA antidiagonal set version=cuda-antidiagonal in your bash
For CUDA hypothetical set version=cuda-hypothetical in your bash

and then do

exe_path=benchmark_${version}.out && \
nvcc -O3 \
    -D QUANTITY_SCALE=13 \
    -D SIZE_SCALE=10 \
    -D XBLOCK_SIZE_SCALE=5 \
    -D YBLOCK_SIZE_SCALE=3 \
    -D ZBLOCK_SIZE_SCALE=2 \
    -D WINDOW_SIZE_SCALE=7 \
    -o $exe_path benchmark.cu && \
./$exe_path $version

Results

Version	insn per cycle	Seconds
Bithacked-striped (optimised-base)	2.77	41.50
SIMD-alpern	2.95	3.76
multicore-alpern	2.26	2.45

hsmajlovic / smith-waterman-optimization