BurntSushi / suffix

Fast suffix arrays for Rust (with Unicode support).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performances

delehef opened this issue · comments

Hi,

I needed a high performance suffix array lib some time ago for a bioinformatics project, and I ended up working with a crude FFI with divsufsort.

Just so you know, its performances while creating the array is blowing suffix out of the water. So feel free to take a glance if it interests you or to close this issue otherwise ;)

Just so you know, its performances while creating the array is blowing suffix out of the water.

Can you please show your benchmarks? Otherwise, I'm not really sure what to do with this issue.

My issue was a bit off the mill and I apologize if I went off as confrontational.

Here is a toy repo with a minimal example; you just have to gunzip the msy.fasta.gz file before either cargo bench- or cargo run-ing. I advise you to run it normally first, the benchmark mode is awfully long in this case.

With my computer, I get these results:

18 matches
PT3.080668480S seconds for divsufsort.
18 matches
PT9.860067663S seconds for suffix.

If you would like to try an example with larger files, you can download this one ftp://ftp.ensemblgenomes.org/pub/plants/release-40/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz for instance, and just adapt the filename in main.rs, line 11.

Does divsufsort support the same stuff as this crate? Does it handle Unicode correctly?

Neither; it is much more crude and it only works when looking for a bytes needle in a bytes array. I just thought you might be interesting by their algorithm.

Thanks! I'll take a look at this when I get a chance.

You're most welcome! On an aside, thanks for everything you're doing in the Rust community.

I'm going to close this out as stale. If I ever get motivated enough to dive back into this space, then I'll investigate performance more closely, but I'm otherwise happy with where this crate is currently at.