smarco / WFA2-lib

WFA-lib: Wavefront alignment algorithm library v2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Do D and I have inverted roles in CIGAR strings?

marcelm opened this issue · comments

Running wfademo.cpp, I noticed that the meaning of D and I in the CIGAR output seems to have been swapped from their usual meaning. Here’s an example taken from the README:

    PATTERN    AGCTA-GTGTCAATGGCTACT---TTTCAGGTCCT
               | ||| |||||  ||||||||   | |||||||||
    TEXT       AACTAAGTGTCGGTGGCTACTATATATCAGGTCCT
    ALIGNMENT  1M1X3M1I5M2X8M3I1M1X9M

The README states that text is equivalent to reference and pattern equivalent to query (which makes sense). If I take the above pattern to be a sequencing read and the text to be a genome reference, then the two gaps would be considered to be deletions, but they are encoded as 1I and 3I, respectively. Or should I think about this differently?

Hi,

This I can answer right away. The WFA2lib follows the convention that describes how to transform the Pattern/Query into the Text/Database/Reference (as in classic pattern matching papers). However, the SAM CIGAR standard works the other way around (as the Reference is the important sequence). Beyond the discussion of which one is better (I think they are both ok), if you want CIGAR-style alignments, just swap pattern <-> text sequences when calling the WFA align function, and you will get all the Ds converted into Is (and vice-versa).

Let me know if that helps.

Thanks! I see. Would you consider adding a comment to the README to make this clear for others as well?

Swapping pattern and text is of course the simplest fix for this, and it is what I’m using at the moment.

Sure (sorry for the delay). Please, have a look into development and let me know if that feels more clear.

Thanks,

Thanks, that is clear enough!