hsmajlovic / smith-waterman-optimization

Performance optimizations for Linear Gap Smith Waterman algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Smith-Waterman performance optimizations

Created as a part of the CSC586C (Data Management on Modern Computer Architectures) course at University of Victoria under supervision of Sean Chester.

Contributors: Ze Shi Li, Rohith Pudari, Haris Smajlovic.

Repository contains performance optimizations for Linear gap Smith-Waterman algorithm.

Optimizations

Note: Check if your CPU supports SSE2/4, AVX2 and/or AVX-512 first. Otherwise the SIMD benchmark will still run, but with no valid results.

So far we have a baseline, bithacked, bithacked-striped, multicore-windowed, windowed, simd-alpern and multicore-alpern version of the very same algorithm for a CPU, and cuda-alpern, cuda-antidiagonal, cuda-hypothetical, and cuda-windowed for a GPU:

  • Baseline: A straight forward baseline version of the SW algorithm.
  • Bithacked: Baseline version with heavy branching replaced with bithacks.
  • Bithacked-striped: Bithacked version with an access pattern that is more L1 cache friendly.
  • Windowed: For a scenario in which traceback is not needed but only the best match score.
  • Multicore windowed: Using the technique above, spreads across multiple CPU cores.
  • SIMDed (Alpern technique): A SIMDed baseline using widest registers that your CPU supports and inter-alignment technique from Alpern et al.
  • Multicore (Alpern technique): Just a SIMDed technique above spread accross multiple CPU cores.
  • CUDA (Alpern technique): A SIMTed baseline using inter-alignment technique akin to SIMD Alpern technique above.
  • CUDA windowed: A SIMT implementation of a windowed version above.
  • CUDA antidiagonal: A 2-dimentional parallelisation exposing both inter and intra alignment parallelism.
  • CUDA hypothetical: A 3-dimentional parallelisation in which all data dependency is ignored and parallelization utilized to full extent.

Testing

In order to benchmark the CPU solutions use perf (for now -- sorry non-linux users). So just compile benchmark.cpp and then run perf on the executable.

For testing the GPU solutions just compile and run benchmark.cu.

Don't forget to provide version string as a CLI argument.

CPU Examples

  • For baseline set version=base in your bash
  • For bithacked set version=bithacked in your bash
  • For bithacked-striped set version=bithacked-striped in your bash
  • For windowed set version=windowed in your bash
  • For multicore-windowed set version=multicore-windowed in your bash
  • For SIMDed (Alpern technique) set version=simd-alpern in your bash
  • For multicore set version=multicore-alpern in your bash

and then do

exe_path=benchmark_${version}.out && \
g++ -D THRD_CNT=2 -march=native -fopenmp -Wall -Og -std=c++17 -o $exe_path benchmark.cpp && \
perf stat -e cycles:u,instructions:u ./$exe_path $version

GPU Examples

  • For CUDA (Aplern technique) set version=cuda-alpern in your bash
  • For CUDA windowed set version=cuda-windowed in your bash
  • For CUDA antidiagonal set version=cuda-antidiagonal in your bash
  • For CUDA hypothetical set version=cuda-hypothetical in your bash

and then do

exe_path=benchmark_${version}.out && \
nvcc -O3 \
    -D QUANTITY_SCALE=13 \
    -D SIZE_SCALE=10 \
    -D XBLOCK_SIZE_SCALE=5 \
    -D YBLOCK_SIZE_SCALE=3 \
    -D ZBLOCK_SIZE_SCALE=2 \
    -D WINDOW_SIZE_SCALE=7 \
    -o $exe_path benchmark.cu && \
./$exe_path $version

Results

Version insn per cycle Seconds
Bithacked-striped (optimised-base) 2.77 41.50
SIMD-alpern 2.95 3.76
multicore-alpern 2.26 2.45

About

Performance optimizations for Linear Gap Smith Waterman algorithm

License:Apache License 2.0


Languages

Language:C++ 69.7%Language:Cuda 30.3%