jeffdaily / parasail

Pairwise Sequence Alignment Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comments/ questions about fatal flaw in gap penalties for my use cases

rcedgar opened this issue · comments

Hi Jeff -- From my initial exploration Parasail looks to close to being a really well-designed and comprehensive library that could be more widely adopted, and I was hoping to use it in my own work, but there are two major flaws from my perspective in the gap penalty implementation.

In the best and most popular bioinformatics software tools (BLAST, HMMer, minimap2, bowtie2, CLUSTALW, MAFFT..., plus my own MUSCLE and USEARCH), the highest accuracy in search and alignment is universally achieved by affine gap penalties with open and extension penalties, i.e. the cost of a gap is Open + (L - 1)*Ext where Open is the gap-open penalty and Ext is the gap-extension penalty. If you try to fudge this with Ext=0 accuracy will suffer. IIUC your library does not support Ext!=0, hence my first concern.

A second, more specialized use case for fast alignment is a pre-filtering step before doing slower "full" d.p. with affine gaps in the subset which pass the filter. For my case, I believe the most effective filter will be Smith-Waterman with no gaps allowed. I see that Parasail has fast variants where traceback is not supported, but IIUC it always allows gaps, which probably means that faster implementations are possible which do not allow gaps. I'm only just starting to get my head around vectorized algorithms for accelerating sequence alignment, but surely "striping" methods can be optimized if stripes never interact with each other.

Questions: Did I understand correctly? Any chance you would be up for addressing these issues? If not, can you offer some pointers about how I could do this myself?

Thanks! Robert.

For your first concern about gap penalties. Are you referring to this line in the README?

Note: When any of the algorithms open a gap, only the gap open penalty alone is applied.

parasail supports affine gap penalties. Perhaps that line in the README is misleading. I meant for it to clarify what happens when the gap open penalty is applied. Other software might choose to apply the gap extend penalty right away when the gap open is applied -- I think this refers to your equation above but modified as Open + L*Ext. In the case of parasail, gap open is applied by itself initially, followed by the extension penalty for subsequent characters. I think that is the same as your equation Open + (L-1)*Ext.

For your second concern about filtering.

parasail was designed for simplicity and performance. Early versions of parasail only had APIs that returned a score and some alignment statistics because that's all my professor needed. The more any single alignment within parasail does, the slower it gets. Tracebacks were added much later based on community requests.

Admittedly, parasail is probably too complicated today with too many functions that nobody ever uses. It started as a research project studying the various vectorized implementations of sequence alignment and the effect of ever-increasing CPU vector widths from SSE to AVX2 to AVX512. I doubt anyone uses the "diag" implementations because they're slow. My main contribution to the research community here was the prefix "scan" implementation that was faster for some alignments than the "striped" in specific cases.

Back to your filtering question. When I started this project there were so many great applications, but there wasn't a simple C API for performing a single pairwise alignment (with the exception of SSW, which I drew inspiration from). A C library for pairwise local/global/semi-global alignments seemed like it was missing from the community. But for the purpose of getting research papers papers published, and as a sample application, the parasail_aligner tool exists. It does have an efficient filter implementation based on suffix arrays so you can filter out any pair of sequences that do not have an exact-match substring of a given minimum length.

parasail only has affine penalties. I barely have time to respond to issues these days, so I will not be implementing other features.

For the record, after further review I can see that Parasail is amazing -- it was obviously a huge amount of work and it does a good job of covering a wide range of use-cases. I'm only just getting my head around it but I can already see I will be much better off starting from your code than trying to figure out SIMD alignment algorithms from scratch. My bad.