felipelouza / egsa

Generalized enhanced suffix array construction in external memory [CPM'13, AMB 2017]

Home Page:https://doi.org/10.1186/s13015-017-0117-9

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

egsa tool

This software is the implementation of eGSA [1, 2], an external memory algorithm to construct generalized enhanced suffix arrays.

Given a collection of K strings, eGSA outputs the:

  • Generalized suffix array
  • LCP-array
  • Burrows-Wheeler transform (optional)

install

git clone https://github.com/felipelouza/egsa.git
cd egsa
make

run

Given a collection of K strings concatenated in a single file FILE.

./egsa [options] FILE K

Available options:

-h    this help message
-b    output BWT (ext: .bwt)
-c    check SA and LCP
-v    verbose output

Notes

  • K=0 gives all strings as input.
  • Supported extensions are: .txt, .fasta and .fastq.
  • Strings are separated per '\0' (new line) in .txt.

output

The generalized enhanced suffix array is stored in a binary file in the same directory of FILE:

FILE.K.gesa

The generalized BWT (option -b) is stored in a textual file in the same directory of FILE:

FILE.K.bwt

Notes

  • The output binary FILE.K.gesa can be read as a struct type according to the typedef in: lib/utils.h

  • Temporary files are stored in subfolders:

partition/
tmp/

options

Output the BWT inside GESA structure, that is GESA[i] = (text, suffix, lcp, bwt):

make clean
make compile BWT=1

One can inform the maximum available internal memory to be used (in MB):

make clean
make compile MEMLIMIT=10

debug

To see a more detailed execution output use:

make clean
make compile DEBUG=1

external resources

We have included the source codes of the following algorithms:

  • gSACA-K: generalized suffix array construction algorithm [3]
  • lcp-PHI: LCP-array construction algorithm given the suffix array [4]

citation

Please, if you use egsa tool in an academic setting cite the following paper [1]:

@article{Louza2017d,
	author = {Louza, Felipe A. and Telles, Guilherme P. and Hoffmann, Steve and Ciferri, Cristina D. A.},
	title={Generalized enhanced suffix array construction in external memory},
	journal={Algorithms for Molecular Biology},
	year = {2017},
	pages = {26:1--26:16},
	doi = {10.1186/s13015-017-0117-9}
}

references

[1] Louza, F. A., Telles, G. P., Hoffmann, S., Ciferri, C. D. A. (2017). Generalized enhanced suffix array construction in external memory. Algorithms for Molecular Biology 12(1): 26:1-26:16. link

[2] Louza, F. A., Telles, G. P., Ciferri, C. D. A. (2013). External Memory Generalized Suffix and LCP Arrays Construction. In Proc. CPM (pp. 201-210). link

[3] Louza, F. A., Gog, S., Telles, G. P. (2016). Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci. 678: 22-39, github

[4] Kärkkäinen, J., Manzini, G., & Puglisi, S. J. (2009). Permuted Longest-Common-Prefix Array. In G. Kucherov & E. Ukkonen (Eds.), Proc. CPM (Vol. 5577, pp. 181–192).

thanks

Thanks to Fabio Garofalo, Giovanna Rosone, Giovanni Manzini, Nicola Prezza and Guilherme P. Telles by helpful suggestions and debugging.

About

Generalized enhanced suffix array construction in external memory [CPM'13, AMB 2017]

https://doi.org/10.1186/s13015-017-0117-9

License:GNU General Public License v3.0


Languages

Language:C 95.9%Language:Shell 2.3%Language:Makefile 1.8%