Feature requests: R bindings and early stopping

Question

Feature requests: R bindings and early stopping

traversc opened this issue a year ago · comments

Nice work!

Would it be possible to get R bindings for this? I put together a minimal example here: https://github.com/traversc/WavefrontAlignR (feel free to do whatever with it)

I'd also like to a request an "early stopping" feature, where if the best possible alignment distance exceeds a user defined threshold, stop alignment and return a flag value (like INT_MAX). Assuming this doesn't add too much overhead, this would be useful because I'm mostly interested in finding only highly similar sequences between two sets.

Last, I ran a quick benchmark comparing an existing R package. Is this a fair comparison? Code used to run WFA2 here: https://github.com/traversc/WavefrontAlignR/blob/main/src/WFA_bindings.cpp

# Benchmark for a 10,000 x 10,000 alignment
# "seqs" is a vector of DNA sequences on average 43 bp long
library(WavefrontAlignR)
library(stringdist)
library(tictoc)

# WFA2 levenshtein
tic()
y1 <- WavefrontAlignR::edit_dist_matrix(seqs, seqs)
toc()
# 191.452 sec elapsed, 522324 alignments / sec

# stringdist levenshtein
tic()
y2 <- stringdist::stringdistmatrix(seqs, seqs, method = "lv", nthread=1)
toc()
# 677.356 sec elapsed, 147633 alignments / sec

Santiago Marco-Sola · Answer 1 · Tue Sep 26 2023 00:37:59 GMT+0800 (China Standard Time)

Sorry for the late reply (I was about to send this message, and then it slipped my mind...).

(1) R bindings

Yes, sure, that would be awesome. At this moment, don't have the bandwidth to implement this feature. But is definitely something I would like to have. Thanks for the example and request.

If you feel like it, you could wrap your example under bindings/r (linked to the current version) and make a pull request. I would be very happy if you take over and take the credit for it. Only if you want to.

(2) Early stop

There is actually one here. the function wavefront_aligner_set_max_alignment_steps allows to set the maximum number of sets (i.e., max alignment score) to reach before quitting. Have a look and let me know if that is what are you looking for.

Let me know,
Thanks.

(3) (NxN) benchmark

In principle, seems fair to me (edit, score only, ...).