lh3 / miniprot

Align proteins to genomes with splicing and frameshift

Home Page:https://lh3.github.io/miniprot/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

duplicate and close variants of the same alignment in the output

azat-badretdin opened this issue · comments

When I use these parameters:

 ./miniprot -G 100 -O 10 -J 34 -F 30 --gff -ut32 nucleotide.fasta proteins.fasta

I get very close variants of the same alignment:

gpipedev21:issue-34$ grep WP_004242317 miniprot.gff  | grep PAF
##PAF   gi|490362554|ref|WP_004242317.1|        343     149     343     +       gi|545778205|gb|U00096.3|       4641652 3221864 3222446 402     582     0       AS:i:680        ms:i:680      np:i:159 da:i:-1 do:i:0  cg:Z:194M       cs:Z::2*accC*gacS*aatA*atcV:2*atcV:2*cacS*gaaD*cccR*ggcQ:1*ggtD:9*cgcY:1*agtA*aaaQ*gaaS*atcV*atcT:2*tatF:1*aacA:2*gttY*aatD:7*gaaQ:1*gagS:1*ggcA*aagA:8*gcgT:3*cgaS:1*aaaR*caaG:3*gaaG:3*tggY:2*ggtD:3*tcgA:3*gaaA:7*cggG:1*gacS:19*attL:2*cgaQ*ggcH*ctgI*aacA:2*cagE:2*tcgA:10*cgaK:2*tttI:1*ccgS:9*atgV:8*gtgL*tatF:1*aaaR*gccL:2*ggtE:1*gcgQ*ctgE:2*ttaQ*gtcI:1*gttA*cccA:1*aaaR:1*aaaI:5*cgtK
##PAF   gi|490362554|ref|WP_004242317.1|        343     154     343     +       gi|545778205|gb|U00096.3|       4641652 3221879 3222446 396     567     0       AS:i:675        ms:i:675      np:i:157 da:i:-1 do:i:0  cg:Z:189M       cs:Z:*atcV:2*atcV:2*cacS*gaaD*cccR*ggcQ:1*ggtD:9*cgcY:1*agtA*aaaQ*gaaS*atcV*atcT:2*tatF:1*aacA:2*gttY*aatD:7*gaaQ:1*gagS:1*ggcA*aagA:8*gcgT:3*cgaS:1*aaaR*caaG:3*gaaG:3*tggY:2*ggtD:3*tcgA:3*gaaA:7*cggG:1*gacS:19*attL:2*cgaQ*ggcH*ctgI*aacA:2*cagE:2*tcgA:10*cgaK:2*tttI:1*ccgS:9*atgV:8*gtgL*tatF:1*aaaR*gccL:2*ggtE:1*gcgQ*ctgE:2*ttaQ*gtcI:1*gttA*cccA:1*aaaR:1*aaaI:5*cgtK

This also expresses itself, maybe, in duplication of some alignment output. For example:

gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001848;Rank=18;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        mRNA    729583  733323  6547    +       .       ID=MP001849;Rank=19;Identity=0.9719;Positive=0.9783;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001849;Rank=19;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        mRNA    729583  733323  6547    +       .       ID=MP001850;Rank=20;Identity=0.9719;Positive=0.9783;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001850;Rank=20;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        mRNA    729583  733323  6547    +       .       ID=MP001851;Rank=21;Identity=0.9719;Positive=0.9783;Target=gi|15829983|ref|NP_308756.1| 1 1247
gi|545778205|gb|U00096.3|       miniprot        CDS     729583  733323  6547    +       0       Parent=MP001851;Rank=21;Identity=0.9719;Target=gi|15829983|ref|NP_308756.1| 1 1247

The alignments are the same, but the Rank=x value is different in each case.

These two different hits. For now, you have to filter them out by yourself.

Thanks. Which example are you talking about? Or both?

Both

For now

This seems that there is a hope that the hits will be on per region in the future?