Allow open cost `o=0`

Question

Allow open cost `o=0`

RagnarGrootKoerkamp opened this issue 3 years ago · comments

Ragnar Groot Koerkamp commented 3 years ago

I don't immediately see a reason why allowing o=0 (in combination with e>0) would break the algorithm. Allowing this would make the wfa algorithm work for computing normal edit distance as well.

Currently, I get this assertion failure:

Mismatch/Gap scores must be strictly positive (X=1,O=0,E=1)

Santiago Marco-Sola · Answer 1 · Tue Jan 25 2022 00:01:00 GMT+0800 (China Standard Time)

Very nice observation!

Indeed, you can. There is nothing (but the check I put here) that prevents doing that. Now, it seemed to me a little bit overkilling to use an affine-gap set up to solve the edit-distance problem. The WFA can be easily adapted to edit distance and be much more efficient using specialized code. But you can do it, definitely ;-)

Note that for the next release, we will release a version with support for edit-distance, dual-cost affine gap, etc. So, we keep that use-case in mind.

Best,

Ragnar Groot Koerkamp · Answer 2 · Tue Jan 25 2022 00:10:16 GMT+0800 (China Standard Time)

That's good to hear!

Since you're comparing to edlib in your paper and are 10-20x faster than them in some cases, I think it makes sense to expose the edit-distance capabilities of WFA, even if your implementation may not be optimal for this case. (Unless there are other implementations that would be much faster still for edit distance, but I'm not aware of any currently.)

Ragnar Groot Koerkamp · Answer 3 · Tue Feb 01 2022 02:36:10 GMT+0800 (China Standard Time)

Btw, I did some testing last week and since you scale with O(ns), changing the gap opening score to 0 significantly speeds up the algorithm -- IIRC I've seen up to 3 times speedup on some cases.

Santiago Marco-Sola · Answer 4 · Tue Feb 01 2022 19:06:19 GMT+0800 (China Standard Time)

Well, that is interesting. Although, it honestly sounds like too much.
What workload are you using for the testing (i.e., number of sequences and length)?

Ragnar Groot Koerkamp · Answer 5 · Wed Feb 02 2022 01:29:19 GMT+0800 (China Standard Time)

I just used generate_dataset to generate some sets of total size 10M and edit distance 5%, like in the paper.
Then I get the following:

n=100000, o=1: 180ms/call

../wfa/bin/align_benchmark -i data/input/x100-n100000-e0.05.seq -a gap-affine-wfa --affine-penalties="0,1,1,1"
  => Time.Alignment      18.40 s  ( 98.07 %) (  100  calls, 184.02 ms/call {min171.62ms,Max356.17ms})

n=100000, o=0: 64ms/call

../wfa/bin/align_benchmark -i data/input/x100-n100000-e0.05.seq -a gap-affine-wfa --affine-penalties="0,1,0,1"
  => Time.Alignment       6.43 s  ( 95.94 %) (  100  calls,  64.28 ms/call {min60.03ms,Max200.00ms})

For length n=1000, I get 1.71ms/call vs 0.65ms/call.
For length n=100 the difference is much smaller: 2.5us/call vs 2us/call.

It would be nice to know the average distance computed distance to see if the correlation is linear. I'd add it myself but it doesn't look so simple to propagate it everywhere.

Santiago Marco-Sola · Answer 6 · Mon Feb 07 2022 23:40:54 GMT+0800 (China Standard Time)

That is impressive. Although it makes sense, it has a more significant effect than I could have thought of.

About the feature request, you could try to use the option --check to get that functional information:

$> ./bin/align_benchmark -a gap-affine-wfa -i ../data/sim.l100.n100K.e5.seq --check
...processed 10000 reads (benchmark=21311.789 reads/s;alignment=248357.859 reads/s)
...processed 20000 reads (benchmark=22019.213 reads/s;alignment=255832.703 reads/s)
...processed 30000 reads (benchmark=22175.760 reads/s;alignment=258166.031 reads/s)
...processed 40000 reads (benchmark=22350.201 reads/s;alignment=260282.422 reads/s)
...processed 50000 reads (benchmark=22478.818 reads/s;alignment=261597.484 reads/s)
...processed 60000 reads (benchmark=22526.113 reads/s;alignment=262233.875 reads/s)
...processed 70000 reads (benchmark=22550.518 reads/s;alignment=262340.125 reads/s)
...processed 80000 reads (benchmark=22585.389 reads/s;alignment=262475.250 reads/s)
...processed 90000 reads (benchmark=22607.168 reads/s;alignment=262652.219 reads/s)
...processed 100000 reads (benchmark=22613.654 reads/s;alignment=262761.000 reads/s)
[Benchmark]
=> Total.reads            100000
=> Time.Benchmark         4.42 s  (    1   call,   4.42  s/call {min4.42s,Max4.42s})
  => Time.Alignment     380.53 ms (  8.61 %) (  100 Kcalls,   3.81 us/call {min497ns,Max47.02us})
[Accuracy]
 => Alignments.Correct      100.00 Kalg        (100.00 %) (samples=100K{mean1.00,min1.00,Max1.00,Var0.00,StdDev0.00)}
 => Score.Correct           100.00 Kalg        (100.00 %) (samples=100K{mean1.00,min1.00,Max1.00,Var0.00,StdDev0.00)}
   => Score.Total             2.94 Mscore uds.            (samples=100K{mean29.41,min0.00,Max40.00,Var37.00,StdDev6.00)}
     => Score.Diff            0.00 score uds.  (  0.00 %) (samples=0,--n/a--)}
 => CIGAR.Correct             0.00 alg         (  0.00 %) (samples=0,--n/a--)}
   => CIGAR.Matches           9.69 Mbases      ( 96.95 %) (samples=9M{mean1.00,min1.00,Max1.00,Var0.00,StdDev0.00)}
   => CIGAR.Mismatches      155.38 Kbases      (  1.55 %) (samples=155K{mean1.00,min1.00,Max1.00,Var0.00,StdDev0.00)}
   => CIGAR.Insertions      149.28 Kbases      (  1.49 %) (samples=149K{mean1.00,min1.00,Max1.00,Var0.00,StdDev0.00)}
   => CIGAR.Deletions       149.65 Kbases      (  1.50 %) (samples=149K{mean1.00,min1.00,Max1.00,Var0.00,StdDev0.00)}

You are looking for "Score.Total" as the addition of all the alignment scores (from all the pair alignments) computed.

Cheers,