alignments commands
colindaven opened this issue · comments
Hi,
this is an interesting tool.
I'm not quite sure how to generate the alignments properly for either wfmash or minimap2.
Would it be something like this ?
# align target vs each genome separately
minimap2 -eqx ....
# cat the minimap pafs
cat *.paf > combined.paf
# run impg on the whole paf
impg ... combined.paf
Or am I misunderstanding something here ?
Thanks,
Colin
It's designed to work with default wfmash output for all-to-all alignment, e.g. with wfmash seqs.fa.gz >aln.paf
. It probably will work with minimap2 but it's not been tested.
I've made a few changes (#4) and now we can parse also minimap2's CIGAR strings:
minimap2 scerevisiae7.fa.gz scerevisiae7.fa.gz -X -c -t 48 > mm2.paf
impg -p mm2.paf -r UWOPS034614#1#chrI:1000-2000 | head -n 5 | column -t
UWOPS034614#1#chrI 1000 2000
S288C#1#chrVIII 570385 569601
DBVPG6765#1#chrVI 5195 5981
DBVPG6765#1#chrI 210581 209798
UWOPS034614#1#chrI 213064 212060
impg -p mm2.paf -r UWOPS034614#1#chrI:1000-2000 -P | head -n 5 | column -t
UWOPS034614#1#chrI 214332 1000 2000 + UWOPS034614#1#chrI 214332 1000 2000 1000 1000 255 cg:Z:1000=
S288C#1#chrVIII 581049 570385 569601 - UWOPS034614#1#chrI 214332 1000 2000 775 1009 255 cg:Z:10M1I29M2I8M1D11M1D78M18D6M1D5M2D1M3D23M1D6M1D19M1D26M1D27M1D39M5D1M2D19M1D94M1I72M1I21M1I10M1D21M3I22M151D205M1D22M33D
DBVPG6765#1#chrVI 257436 5195 5981 + UWOPS034614#1#chrI 214332 1000 2000 777 1009 255 cg:Z:10M1I29M2I8M1D11M1D78M18D6M1D5M2D1M1D25M1D6M1D19M1D26M1D27M1D39M5D1M2D19M1D94M1I72M1I21M1I10M1D21M3I22M151D205M1D22M33D
DBVPG6765#1#chrI 215496 210581 209798 - UWOPS034614#1#chrI 214332 1000 2000 774 1009 255 cg:Z:10M1I29M2I8M1D11M1D78M18D6M1D5M6D23M1D6M1D19M1D26M1D27M1D39M5D1M2D19M1D94M1I72M1I21M1I10M1D21M3I22M151D205M1D22M33D
UWOPS034614#1#chrI 214332 213064 212060 - UWOPS034614#1#chrI 214332 1000 2000 1000 1004 255 cg:Z:171M4I829M
# --eqx to write =/X CIGAR operators
minimap2 scerevisiae7.fa.gz scerevisiae7.fa.gz -X -c -t 48 --eqx > mm2.eqx.paf
impg -p mm2.eqx.paf -r UWOPS034614#1#chrI:1000-2000 | head -n 5 | column -t
UWOPS034614#1#chrI 1000 2000
S288C#1#chrVIII 570385 569601
DBVPG6765#1#chrVI 5195 5981
DBVPG6765#1#chrI 210581 209798
UWOPS034614#1#chrI 213064 212060
impg -p mm2.eqx.paf -r UWOPS034614#1#chrI:1000-2000 -P | head -n 5 | column -t
UWOPS034614#1#chrI 214332 1000 2000 + UWOPS034614#1#chrI 214332 1000 2000 1000 1000 255 cg:Z:1000=
S288C#1#chrVIII 581049 570385 569601 - UWOPS034614#1#chrI 214332 1000 2000 665 1009 255 cg:Z:10=1I24=1X4=2I2=1X5=1D3=1X1=1X5=1D3=2X10=1X29=1X4=1X1=2X9=2X13=18D6=1D5=2D1=3D19=2X2=1D6=1D1X16=1X1=1D10=1X12=2X1=1D1=1X2=2X3=1X17=1D15=1X18=1X1=1X2=5D1=2D4=1X1=1X3=1X1=1X1=1X4=1D1X5=1X1=1X7=1X5=1X2=1X6=1X4=1X15=1X4=3X7=1X1=1X1=2X10=1X1=1X1=1X1=1X3=1I13=1X5=1X2=3X3=1X1=1X1=1X2=1X4=1X1=1X2=1X12=1X7=1X5=1I1X1=1X3=1X4=2X2=1X3=1X1=1I10=1D4=1X4=1X2=1X8=3I16=1X5=151D17=2X7=1X23=1X5=1X14=1X20=1X2=1X2=2X4=1X5=1X12=1X4=1X5=2X4=1X3=1X1=1X1=2X2=1X1=1X3=1X2=1X11=1X2=1X8=1X3=1X4=1X4=2X4=1D17=1X4=33D
DBVPG6765#1#chrVI 257436 5195 5981 + UWOPS034614#1#chrI 214332 1000 2000 667 1009 255 cg:Z:10=1I24=1X4=2I2=1X5=1D3=1X1=1X5=1D3=2X10=1X29=1X4=1X1=2X9=2X13=18D6=1D5=2D1=1D21=2X2=1D6=1D1X16=1X1=1D10=1X12=2X1=1D1=1X2=2X3=1X17=1D15=1X18=1X1=1X2=5D1=2D4=1X1=1X3=1X1=1X1=1X4=1D1X5=1X1=1X7=1X5=1X2=1X6=1X4=1X15=1X4=3X7=1X1=1X1=2X10=1X1=1X1=1X1=1X3=1I13=1X5=1X2=3X3=1X1=1X1=1X2=1X4=1X1=1X2=1X12=1X7=1X5=1I1X1=1X3=1X4=2X2=1X3=1X1=1I10=1D4=1X4=1X2=1X8=3I16=1X5=151D17=2X7=1X23=1X5=1X14=1X20=1X2=1X2=2X4=1X5=1X12=1X4=1X5=2X4=1X3=1X1=1X1=2X2=1X1=1X3=1X2=1X11=1X2=1X8=1X3=1X4=1X4=2X4=1D17=1X4=33D
DBVPG6765#1#chrI 215496 210581 209798 - UWOPS034614#1#chrI 214332 1000 2000 664 1009 255 cg:Z:10=1I24=1X4=2I2=1X5=1D3=1X1=1X5=1D3=2X10=1X29=1X4=1X1=2X9=2X13=18D6=1D5=6D19=2X2=1D6=1D1X16=1X1=1D10=1X12=2X1=1D1=1X2=2X3=1X17=1D15=1X18=1X1=1X2=5D1=2D4=1X1=1X3=1X1=1X1=1X4=1D1X5=1X1=1X7=1X5=1X2=1X6=1X4=1X15=1X4=3X7=1X1=1X1=2X10=1X1=1X1=1X1=1X3=1I13=1X5=1X2=3X3=1X1=1X1=1X2=1X4=1X1=1X2=1X12=1X7=1X5=1I1X1=1X3=1X4=2X2=1X3=1X1=1I10=1D4=1X4=1X2=1X8=3I16=1X5=151D17=2X7=1X23=1X5=1X14=1X20=1X2=1X2=2X4=1X5=1X12=1X4=1X5=2X4=1X3=1X1=1X1=2X2=1X1=1X3=1X2=1X11=1X2=1X8=1X3=1X4=1X4=2X4=1D17=1X4=33D
UWOPS034614#1#chrI 214332 213064 212060 - UWOPS034614#1#chrI 214332 1000 2000 1000 1004 255 cg:Z:171=4I829=
Ok, thanks so much for the details, it is clear now.
I'll concat the fastas before alignment and test out the new mm2 code on some excessively large plant genomes.
I'll try wfmash as well.
aligners: We have developed wfmash to work well on plant genomes. A lot of testing has focused on comparing wfmash to other methods in highly divergent regions. It's similar in performance to anchorwave but does not depend on gene annotations. Let us know what works and doesn't. We are developing the publication now after several years of refinement.
impg: Note also that you can use bgzip indexing with the PAF. I'll update the readme to make it more clear.