brentp / mosdepth

fast BAM/CRAM depth calculation for WGS, exome, or targeted sequencing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Long run time with large assembly and many contigs

ohickl opened this issue · comments

Hi,
I am trying to run mosdepth like this:

export MOSDEPTH_PRECISION=6
mosdepth --fast-mode \
         --by <assembly.fa.bed3> \
         --threads 4 \
         --thresholds 1,10,30 \
         <assembly.fa> \
         <mapping.bam>

on metagenomes but the runtime is exceeding two days, so I cant complete the jobs on my regular cluster nodes.
They do have many (short) contigs, e.g.:

> seqkit stats <assembly>.fa
file           format  type   num_seqs        sum_len  min_len  avg_len  max_len
<assembly>.fa  FASTA   DNA   2,041,321  1,499,035,773      200    734.3  372,868
> wc -l <assembly>.fa.bed3                             
2041321 <assembly>.fa.bed3

WGS and RNA-seq mapping stats:

> samtools .../mg.reads.sorted.bam
165453111 + 0 in total (QC-passed reads + QC-failed reads)
...
149618292 + 0 mapped (90.43% : N/A)
...
> samtools flagstat .../mt.reads.sorted.bam
147055988 + 0 in total (QC-passed reads + QC-failed reads)
...
147055988 + 0 mapped (100.00% : N/A)
...

Is this expected? Could it be related to #56 or #71? I dont have any memory issues so far though.

The same also happens when supplying a bed with genes for --by:

> cut -f 4 genes.bed | wc -l
3009287
> head genes.bed
contig_001   156     318     gene_0079        0       +       156     318     0       1       162,    0,
contig_002   2       167     gene_0080        0       -       2       167     0       1       165,    0,
contig_003   1       304     gene_0081        0       +       1       304     0       1       303,    0,
contig_004   145     331     gene_0082        0       -       145     331     0       1       186,    0,
contig_006   1       298     gene_0083        0       -       1       298     0       1       297,    0,
contig_009   2       167     gene_0084        0       -       2       167     0       1       165,    0,
contig_010   2       299     gene_0085        0       +       2       299     0       1       297,    0,
contig_012   1       358     gene_0086        0       -       1       358     0       1       357,    0,
contig_013   0       258     gene_0087        0       +       0       258     0       1       258,    0,
contig_013   274     352     gene_0088        0       +       274     352     0       1       78,     0,

Best

Oskar

Hi, is it possible to share your bam?

This is a bad case for mosdepth with 2 million very short contigs, but I don't think it should take this long.

I would try without --thresholds (and perhaps --by) to see if that is the cause of the slowness. If it is, I'll see if I can make that part of the code more efficient.

Thanks for the swift response!
I'll try that and also send you a link to example bam files per mail. Is the address in your profile fine?

Thanks for the swift response! I'll try that and also send you a link to example bam files per mail. Is the address in your profile fine?

yes. thank you.

hi @ohickl ,
thanks very much for sending the test-case.
I found the cause for the slowness and I'm working on a solution.
You have 2 million contigs. As every new contig was assayed mosdepth was creating an array of structs of that length. For most bams with a few hundred contigs at most, that's not a problem.

On my quick test, time ../../mosdepth -x t mg.reads.sorted.bam --by assembly.bed3 --thresholds 1,10,30 should now take 25 minutes whereas before it would be probably weeks to finish.

I'll make a new release when I have this done and tested.

Great to hear! Thanks for your efforts!

Hi @ohickl would you give this binary a try? I removed a couple more things that would be slow for many chroms.

you can gunzip, chmod +x and then run as ./mosdepth_dev [args]

mosdepth_dev.gz

Looks good! Barely 10 min per run for the handful runs I tried. Also appreciate the progress stdout. Thanks a lot!
Anything that I should know regarding the output or arguments or should it be exactly the same as before?

any options should be fine, I think. Let me know if you see any huge changes with different arguments.