maickrau / PartSortBWT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PartSortBWT

Low memory BWT computation. Usage:

std::string input;
input.push_back(1); // N
input.push_back(2); // A
input.push_back(3); // C
input.push_back(4); // G
input.push_back(5); // T
input.push_back(0); // $
std::string output;
PartSortBWT(input, output); // computes BWT, stores in output
PartSortBWT(input, input); // computes BWT, stores in the same string and overwrites

The input and output can be the same string. For a real example see src/main.cpp. The input string must end with 0, and all other characters must be between 1-5 representing the alphabet $NACGT.

Benchmark

Measured with /usr/bin/time -v bin/main input_file > /dev/null using commit 489c112 on a laptop.

Dataset Size Time (h:mm:ss) Memory
Human assembly 6.0 Gbp 1:15:51 9.0 Gb
GRCh38+alts 3.3 Gbp 0:40:01 4.9 Gb
Assembly graph hpc 3.8 Gbp 0:31:11 6.5 Gb
CHM13 3.1 Gbp 0:23:45 4.6 Gb
CHM13 hpc 2.1 Gbp 0:12:02 3.6 Gb
CHM13 chr1 248 Mbp 0:01:40 375 Mb
CHM13 chr1 hpc 175 Mbp 0:00:49 291 Mb
ACACACAC... 1 Mbp 0:05:25 11 Mbp
ACACACAC... 100 kbp 0:00:02 3 Mbp

Method

Classify suffixes according to their first 4 characters. First collect and sort all suffixes starting with AAAA to get the first x characters of the BWT, then suffixes with AAAC to get the next y characters, etc. Repeat for all 4-character prefixes. Special case for homopolymer runs of >=4 chars: sort the suffixes starting at the ends of the homopolymer runs, then do induced sorting to sort the rest of the homopolymer runs.

Expected memory use for a random DNA string is input + output + ~0.875n bytes. If input == output then total memory use is ~1.875n bytes. Less random strings have higher memory use and worst case is dinucleotide repeating (eg ACACACAC...) using ~8n bytes. Expected runtime for a random string is O(n (log n)^2) and worst case is O(n^2 log n).

About

License:MIT License


Languages

Language:C++ 98.4%Language:Makefile 1.6%