Read Mapping Project

Description

This project implements a fast and accurate read mapping algorithm for DNA sequences using C++.

The following algorithms are used:

Suffix array construction using the SAIS algorithm.
Smith-Waterman algorithm for string alignment.

Environment

I developed the project on an M1 Macbook Pro using Apple's clang g++ compiler (version 14.0.3). The project should be able to be compiled on any system with a C++ compiler that supports C++17.

Compiling

To compile the project, run the following command in the project directory:

mkdir build
cd build
cmake ..
make
cd ..

Two binaries will be generated:

debug - a debug build of the project, with debug symbols and no optimizations.
release - a release build of the project, with -O3 optimizations. Please use this binary to process large inputs.

Running

The program takes two arguments: the path to the reference genome, and the path to the reads file. Under the project directory, run the following command:

mkdir output
./build/release <path to reference genome> <path to reads file>

The output will be written to output/output.txt.

To reformat the output to the submission .zip format, run the following bash script:

./format_output.sh

References and Acknowledgements

When developing the project, I acknowledge using the following sources in learning relevant algorithms and adapting codes from them.

SAIS algorithm for suffix array construction.
SAIS algorithm C++ implementation.
C++17 parallelized std algorithms for Apple clang compiler.
Smith-Waterman Algorithm for string alignment.

ykozxy / ReadMapping