ekimb / rust-seq2kminmers

Iterate k-min-mers from a DNA sequence in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

rust-seq2kminmers

Rust crate to convert any DNA sequence to an iterator of its k-min-mers. It is useful for rust-mdbg.

Intro

This library performs a variant of Run Length Encoding (called Homopolymer Compression) followed by fast rolling hash of DNA sequences. Its purpose is to very quickly convert any DNA sequence into an ordered list of hashes. This then allows to sketch the DNA sequence and do various computations faster, on the sketch rather the initial nucleotides.

For a primer on k-min-mers and minimizer-space representations, see: https://www.cell.com/cell-systems/fulltext/S2405-4712(21)00332-X?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS240547122100332X%3Fshowall%3Dtrue

Features

  • Uses NtHash1 behind the scenes.

  • Allows fast iteration in HPC-space but reports positions in original sequence space.

  • SIMD implementation for Homopolymer Compression

  • SIMD implementation for 32-bit NtHash1 and some 31-bit variant of NtHash2 (modified from bcgsc/ntHash#9)

Performance

Very good, thanks for asking 😉 Per thread, scalar ntHash is around 100-200 MB/sec, SIMD ntHash around 1 GB/sec, HPC alone around 4 GB/sec

Testing

cargo run # to run a small test in main.rs

cargo run -- [filename.fasta] [nb_threads] # to run a bigger test using an input file (also main.rs)

cargo test # run some tests 

cargo bench # run some throughput benchmarks

About

Iterate k-min-mers from a DNA sequence in Rust


Languages

Language:Rust 100.0%