smarco / WFA2-lib

WFA-lib: Wavefront alignment algorithm library v2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Which CIGAR is returned when multiple alignments have equal scores?

brendanf opened this issue · comments

It is generally the case that there may be multiple alignments with the same score, and of course all of these are equally correct. However, my main purpose of alignment is to calculate the pairwise identity score as used in, e.g., BLAST, which is equivalent to $1 - \frac{\text{edit distance}}{\text{alignment length}}$. I am using end2end, i.e. fully global, alignment.

Actually maximizing this pairwise ID score would involve a variable gap penalty depending on the total number of gaps, which is not what I am interested in. However, it would be nice to, after calculating the edit distance, be able to calculate the pairwise identity using either the minimum or maximum alignment length corresponding to that edit distance. I have noticed in my testing that WFA2 usually, but not always, gives shorter alignments than edlib; this suggests that both are using traceback algorithms which simply try to find one alignment, and do not worry about any limits on the length.

Hi,

Let me answer the question in a different order:

(1) WFA2 usually, but not always, gives shorter alignments than edlib; this suggests that both are using traceback algorithms which simply try to find one alignment, and do not worry about any limits on the length.

Usually, the traceback is implemented to retrieve only one of the optimal alignments (as there can be many). Each implementation takes arbitrary decisions on how to solve ties. Thus, each implementation "prefers" a different optimal alignment. In the case of WFA2, "priorities" are hardcoded here. As you see, WFA2 prefers deletions in the query from insertions.

(2) maximizing this pairwise ID score [...]

It seems that you want to go beyond minimizing the edit distance. From all the edit-optimal alignments (let's say with distance $e$), you would like to minimize $id = \frac{e}{\text{alignment length}}$ = $\frac{e}{X+I+M}$. Because $M = |Q| - D - X$, then $id = \frac{e}{e - 2D + |Q| - X}$ (where $|Q|$ is the query length). Hence, you want to minimize the number of deletions $D$ and mismatches $X$ (in the case of ties for the optimal alignment of distance $e$). This would imply tracing-back all optimal alignments and selecting the one with min{ $D+X$ } :-)

No, I haven't implemented that here, sorry.

Chances are that, inverting the query and text arguments, the returned WFA2 CIGARs are closer to what you are expecting (or redefining the priorities here).

Thanks for the reply! That pointer to the code definitely helps me to understand.

To clarify a little, for me "query" and "text" are interchangeable; I have a large number of sequences, and I want to find pairwise ID between all of them where the distance is less than a threshold. Thus the "alignment length" I am concerned with is defined symmetrically for the two sequences, $L = X + I + D + M$. For fixed $e$, the shortest (alt. longest) length should be achieved by minimizing (alt. maximizing) $I + D$. For global alignment $I - D$ is a constant depending on the lengths of the two sequences, so minimizing either one minimizes the other.

If I understand correctly, the priorities you linked to are such that a higher number means that WFA2 prefers to select that operation during traceback? So when a mismatch is a valid option, it chooses that; if no mismatch is a valid option it chooses a deletion, and then if no deletion is possible it chooses an insertion?

If I am reading the corresponding code in edlib correctly, this is exactly opposite to the prioritization of edlib, and that makes sense given the difference in the results I get between them. WFA2 prioritizes mismatches over indels, so at each step it is greedily "trying" to make the alignment as short as possible; edlib prioritizes indels over mismatches, so at each step it is greedily "trying" to make the alignment as long as possible. Neither priority is guaranteed to lead to the true minimum/maximum alignment length, since sometimes choosing an indel instead of a mismatch now will lead to an opportunity later in the traceback to take two mismatches instead of two indels; actually maximizing or minimizing the length would require visiting all cells that can be on any score-optimal alignment path, and WPA2 does not do that (nor edlib) because it is unnecessary for most uses.

I can envision a modified algorithm that tracks the minimum and maximum length for each cell during the initial calculation, and could then report those along with the optimum score without needing to do a traceback. I may try to implement that at some point, but for now I think you've answered my question. Thanks!

If I understand correctly, the priorities you linked to are such that a higher number means that WFA2 prefers to select that operation during traceback?

Yes.

So when a mismatch is a valid option, it chooses that; if no mismatch is a valid option it chooses a deletion, and then if no deletion is possible it chooses an insertion?

In case of ties (i.e., the traceback, at a given position, can choose between multiple operations; I, D, X), the WFA backtrace chooses X first, then D, then I.

If I am reading the corresponding code in edlib correctly, this is exactly opposite to the prioritization of edlib, and that makes sense given the difference in the results I get between them.

I don't know more than you. Perhaps @Martinsos had a reason to program those priorities. I don't know.

Neither priority is guaranteed to lead to the true minimum/maximum alignment length,
Actually maximizing or minimizing the length would require visiting all cells that can be on any score-optimal alignment path, and WPA2 does not do that (nor edlib) because it is unnecessary for most uses.

Both correct.