Long run time with large assembly and many contigs
ohickl opened this issue · comments
Hi,
I am trying to run mosdepth like this:
export MOSDEPTH_PRECISION=6
mosdepth --fast-mode \
--by <assembly.fa.bed3> \
--threads 4 \
--thresholds 1,10,30 \
<assembly.fa> \
<mapping.bam>
on metagenomes but the runtime is exceeding two days, so I cant complete the jobs on my regular cluster nodes.
They do have many (short) contigs, e.g.:
> seqkit stats <assembly>.fa
file format type num_seqs sum_len min_len avg_len max_len
<assembly>.fa FASTA DNA 2,041,321 1,499,035,773 200 734.3 372,868
> wc -l <assembly>.fa.bed3
2041321 <assembly>.fa.bed3
WGS and RNA-seq mapping stats:
> samtools .../mg.reads.sorted.bam
165453111 + 0 in total (QC-passed reads + QC-failed reads)
...
149618292 + 0 mapped (90.43% : N/A)
...
> samtools flagstat .../mt.reads.sorted.bam
147055988 + 0 in total (QC-passed reads + QC-failed reads)
...
147055988 + 0 mapped (100.00% : N/A)
...
Is this expected? Could it be related to #56 or #71? I dont have any memory issues so far though.
The same also happens when supplying a bed with genes for --by
:
> cut -f 4 genes.bed | wc -l
3009287
> head genes.bed
contig_001 156 318 gene_0079 0 + 156 318 0 1 162, 0,
contig_002 2 167 gene_0080 0 - 2 167 0 1 165, 0,
contig_003 1 304 gene_0081 0 + 1 304 0 1 303, 0,
contig_004 145 331 gene_0082 0 - 145 331 0 1 186, 0,
contig_006 1 298 gene_0083 0 - 1 298 0 1 297, 0,
contig_009 2 167 gene_0084 0 - 2 167 0 1 165, 0,
contig_010 2 299 gene_0085 0 + 2 299 0 1 297, 0,
contig_012 1 358 gene_0086 0 - 1 358 0 1 357, 0,
contig_013 0 258 gene_0087 0 + 0 258 0 1 258, 0,
contig_013 274 352 gene_0088 0 + 274 352 0 1 78, 0,
Best
Oskar
Hi, is it possible to share your bam?
This is a bad case for mosdepth with 2 million very short contigs, but I don't think it should take this long.
I would try without --thresholds (and perhaps --by) to see if that is the cause of the slowness. If it is, I'll see if I can make that part of the code more efficient.
Thanks for the swift response!
I'll try that and also send you a link to example bam files per mail. Is the address in your profile fine?
Thanks for the swift response! I'll try that and also send you a link to example bam files per mail. Is the address in your profile fine?
yes. thank you.
hi @ohickl ,
thanks very much for sending the test-case.
I found the cause for the slowness and I'm working on a solution.
You have 2 million contigs. As every new contig was assayed mosdepth
was creating an array of structs of that length. For most bams with a few hundred contigs at most, that's not a problem.
On my quick test, time ../../mosdepth -x t mg.reads.sorted.bam --by assembly.bed3 --thresholds 1,10,30
should now take 25 minutes whereas before it would be probably weeks to finish.
I'll make a new release when I have this done and tested.
Great to hear! Thanks for your efforts!
Hi @ohickl would you give this binary a try? I removed a couple more things that would be slow for many chroms.
you can gunzip, chmod +x and then run as ./mosdepth_dev [args]
Looks good! Barely 10 min per run for the handful runs I tried. Also appreciate the progress stdout. Thanks a lot!
Anything that I should know regarding the output or arguments or should it be exactly the same as before?
any options should be fine, I think. Let me know if you see any huge changes with different arguments.