A best practices document for assembly to assembly comparisons?

Question

A best practices document for assembly to assembly comparisons?

lfaller opened this issue 6 years ago · comments

I am looking for a tool to do assembly to assembly mapping for bacteria and I am very excited to have found minimap2!

Do you have a best practices document for this use case?

Or is minimap2 -ax asm5 ref.fasta query.fasta > alignment.sam the way to go?

Edited to add that I am specifically curious about how to annotate differences between the two assemblies (i.e. SNPs, indels, etc).

Thanks!
~Lina

Heng Li · Answer 1 · Tue Feb 13 2018 07:35:04 GMT+0800 (China Standard Time)

Use asm5 for assemblies from the same species. Use asm10 if they are slightly different. Use -a if the sequence divergence is high.

Variant calling is a bit complicated. You need to:

git clone https://github.com/lh3/minimap2   # you need the latest version; not from conda
cd minimap2 && make

curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf -
cp k8-0.2.4/k8-`uname -s` k8   # or copy it to a directory on you $PATH

./minimap2 -cx asm5 --cs ref.fa query.fa | sort -k6,6 -k8,8n | ./k8 misc/paftools.js call - var.txt

You can find a brief explanation of the output here. The documentation there will be changed in the near future.

lfaller · Answer 2 · Tue Feb 13 2018 23:09:00 GMT+0800 (China Standard Time)

This is great, thanks for the example code!

When I ran this on my data, I actually got no SNPs/indels reported:

[M::mm_idx_gen::0.376*0.99] collected minimizers
[M::mm_idx_gen::0.442*1.28] sorted minimizers
[M::main::0.442*1.28] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::0.477*1.27] mid_occ = 7
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 6
[M::mm_idx_stat::0.502*1.25] distinct minimizers: 899995 (98.78% are singletons); average occurrences: 1.017; average spacing: 10.002
[M::worker_pipeline::2.792*1.37] mapped 4 sequences
[M::main] Version: 2.8-r672
[M::main] CMD: minimap2 -ax asm5 --cs wt_minion.fasta wt_pacbio.fasta
[M::main] Real time: 2.857 sec; CPU: 3.892 sec
0 reference bases covered by exactly one contig
0 substitutions; ts/tv = NaN
0 1bp deletions
0 1bp insertions
0 2bp deletions
0 2bp insertions
0 [3,50) deletions
0 [3,50) insertions
0 >=50 deletions
0 >=50 insertions

I used the asm5 option because the data I have is from two microbial strains of the same species.

Do you think my strains are too similar? Is there a way to tune the algorithm further?

Thanks for any insight!
~Lina

Heng Li · Answer 3 · Tue Feb 13 2018 23:14:46 GMT+0800 (China Standard Time)

It is more likely that the two strains are too divergent. You may first try:

./minimap2 -c --cs ref.fa query.fa | sort -k6,6 -k8,8n | ./k8 misc/paftools.js call -L20000 - > var.txt

and see what is happening.

Heng Li · Answer 4 · Tue Feb 13 2018 23:15:49 GMT+0800 (China Standard Time)

Sorry, a typo. You should use miniamp2 -c, not minimap2 -a. My mistake.

lfaller · Answer 5 · Wed Feb 14 2018 01:38:10 GMT+0800 (China Standard Time)

This did the trick:

[M::mm_idx_gen::0.424*1.00] collected minimizers
[M::mm_idx_gen::0.559*1.47] sorted minimizers
[M::main::0.559*1.47] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::0.608*1.43] mid_occ = 14
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 6
[M::mm_idx_stat::0.636*1.41] distinct minimizers: 1443966 (88.05% are singletons); average occurrences: 1.182; average spacing: 5.364
[M::worker_pipeline::4.004*1.38] mapped 4 sequences
[M::main] Version: 2.8-r672
[M::main] CMD: minimap2 -c --cs wt_minion.fasta wt_pacbio.fasta
[M::main] Real time: 4.069 sec; CPU: 5.572 sec

9131478 reference bases covered by exactly one contig
16087 substitutions; ts/tv = 0.910
19791 1bp deletions
25024 1bp insertions
8819 2bp deletions
3097 2bp insertions
4964 [3,50) deletions
481 [3,50) insertions
2 >=50 deletions
0 >=50 insertions

I noticed the output format is not VCF. Do you know of a tool that converts this to VCF? I would like to overlay this in Geneious to visualize the mutation landscape.

Thanks!

Heng Li · Answer 6 · Wed Feb 14 2018 01:56:38 GMT+0800 (China Standard Time)

Unfortunately, it will take some non-trivial effort to implement VCF in paftools.js. I will consider that, but probably not soon.

By the way, I suspect that most substitutions and gaps are consensus errors, not true variants. For this dataset, you can consider to add -xasm5 or -xasm10 to the command line, though the end results will probably similar anyway.

lfaller · Answer 7 · Wed Feb 14 2018 02:32:06 GMT+0800 (China Standard Time)

It changed the output a little bit:

[M::mm_idx_gen::0.385*1.00] collected minimizers
[M::mm_idx_gen::0.447*1.25] sorted minimizers
[M::main::0.447*1.25] loaded/built the index for 6 target sequence(s)
[M::mm_mapopt_update::0.477*1.23] mid_occ = 7
[M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 6
[M::mm_idx_stat::0.499*1.23] distinct minimizers: 899995 (98.78% are singletons); average occurrences: 1.017; average spacing: 10.002
[M::worker_pipeline::2.735*1.40] mapped 4 sequences
[M::main] Version: 2.8-r672
[M::main] CMD: minimap2 -xasm5 -c --cs wt_minion.fasta wt_pacbio.fasta
[M::main] Real time: 2.765 sec; CPU: 3.864 sec

8877059 reference bases covered by exactly one contig
18700 substitutions; ts/tv = 0.783
17573 1bp deletions
22209 1bp insertions
8037 2bp deletions
3006 2bp insertions
4913 [3,50) deletions
603 [3,50) insertions
1 >=50 deletions
0 >=50 insertions

By the way, I am looking at a minion and a pacbio assembly of the same strain. The minion assembly was error-corrected with illumina but I believe the Pacbio wasn't. I will try error correcting the pacbio data to see if some of these SNPs disappear.

If you do end up implementing VCF, that would be great!

Heng Li · Answer 8 · Fri Feb 16 2018 21:55:12 GMT+0800 (China Standard Time)

You can generate VCF with:

git clone https://github.com/lh3/htsbox
(cd htsbox && make)
minimap2 -axasm5 wt_minion.fasta wt_pacbio.fasta | samtool sort - > sorted.bam
htsbox/htsbox pileup -q5 -S10000 -vcf wt_minion.fasta sorted.bam > diff.vcf

lfaller · Answer 9 · Sat Feb 17 2018 03:47:18 GMT+0800 (China Standard Time)

I error-corrected the pacbio data but it didn't seem to affect the SNP calls too much.

In your own work, how do you "consume" the data that minimap2 produces? Is there another visualization method you can recommend? How do you cross-reference SNP calls with annotation?

Thanks for any insight you might have!

jblamyatifremer · Answer 10 · Thu Mar 15 2018 05:36:29 GMT+0800 (China Standard Time)

Dear Lh3,

I want to reduce a diploid assembling to haploid assembly (from canu) using minimap2. My species seems to have 90% of repeated regions, 7% of heterozygotie and probably a lot of SV (but diffuclt to infer without a good assembly). This species will be a good candidat for assemblathon !

I will copy the canu_assembly.fa to read canu_read.fa

option 1

./minimap2 -cx asm10 ./XX/canu_assembly.fa ./XX/canu_read.fa > aln.paf

option 1 bis

I also plan to modify the alignment option (-k19 -w19 -A1 -B9 -O16,41 -E2,1 -s200 -z200) to stick to the biological features of our species.

option 2

./minimap2 -d ./XX/canu_assembly.mmi ./XX/canu_assembly.fa
./minimap2 -a ./XX/canu_assembly.mmi ./XX/canu_read.fa > ./XX/test.sam

What option seems to you the smartest ? Or a new one ?
Cheers,

JB

Heng Li · Answer 11 · Fri Mar 16 2018 00:31:33 GMT+0800 (China Standard Time)

@lfaller probably it is too late, but just let you know that paftools.js call now supports VCF output. To use it, add -f ref.fa to the command line I showed above:

./minimap2 -c --cs ref.fa query.fa \
  | sort -k6,6 -k8,8n \
  | ./k8 misc/paftools.js call -f ref.fa -L20000 - > var.vcf

@jblamyatifremer the latest github master has a new asm20 preset. It is probably better to use that:

./minimap2 -cx asm20 ./XX/canu_assembly.fa ./XX/canu_read.fa > aln.paf

jblamyatifremer · Answer 12 · Fri Mar 16 2018 00:56:29 GMT+0800 (China Standard Time)

I tried your preset "asm20", it appears that minimap2 (version 2.9) does not have the preset "asm20".

Before closing this post, Could you provide the full combination of value for alignment parameters (the preset). You could close this thread... at least from my point of view.
JB

Heng Li · Answer 13 · Fri Mar 23 2018 02:18:19 GMT+0800 (China Standard Time)

@jblamyatifremer you need to use the github master branch.

Closing...

Dario Copetti · Answer 14 · Wed Oct 10 2018 18:56:34 GMT+0800 (China Standard Time)

Use asm5 for assemblies from the same species. Use asm10 if they are slightly different. Use -a if the sequence divergence is high.

Variant calling is a bit complicated. You need to:
git clone https://github.com/lh3/minimap2   # you need the latest version; not from conda
cd minimap2 && make

curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf -
cp k8-0.2.4/k8-`uname -s` k8   # or copy it to a directory on you $PATH

./minimap2 -cx asm5 --cs ref.fa query.fa | sort -k6,6 -k8,8n | ./k8 misc/paftools.js call - var.txt
You can find a brief explanation of the output here. The documentation there will be changed in the near future.

Hello,
I was able to install k8 but I don't know where to find the paftools.js file: can you point me to it?
Thanks

Heng Li · Answer 15 · Wed Oct 10 2018 19:42:16 GMT+0800 (China Standard Time)

"misc/paftools.js" also in binary release.

Dario Copetti · Answer 16 · Wed Oct 10 2018 20:30:56 GMT+0800 (China Standard Time)

I downloaded the paftool.js from here, then when I run it I get (k8 is in the path, both are in the same folder):

k8 paftools.js
paftools.js:7: SyntaxError: Unexpected token <
<!DOCTYPE html>
^
SyntaxError: Unexpected token <

I see now: I downloaded k8 in a different folder than minimap2 (downloaded with Bioconda), and I don't have the misc/fodler there. This is how my folder (added to the the path) looks now:

$ tree .
.
├── k8
├── k8.cc
├── k8-Darwin
├── k8.js
├── k8-Linux
├── khash.h
├── paftools.js
└── README.md

Heng Li · Answer 17 · Wed Oct 10 2018 21:17:16 GMT+0800 (China Standard Time)

It is minimap2 release:

https://github.com/lh3/minimap2/releases

Heng Li · Answer 18 · Wed Oct 10 2018 21:18:15 GMT+0800 (China Standard Time)

Your downloaded paftools.js as HTML. You need to download "raw" or clone it.

Dario Copetti · Answer 19 · Wed Oct 10 2018 21:33:58 GMT+0800 (China Standard Time)

OK, it works now, I installed minimap and k8 outisde of conda.
Thanks!

Nazia Tasnim · Answer 20 · Wed Sep 09 2020 23:39:35 GMT+0800 (China Standard Time)

It is more likely that the two strains are too divergent. You may first try:
./minimap2 -c --cs ref.fa query.fa | sort -k6,6 -k8,8n | ./k8 misc/paftools.js call -L20000 - > var.txt
and see what is happening.

is this an alternate for the asm paramters? I tried using asm20 with -cx , --cs which produced an empty variant call.