Suggestions for speeding up Nanoplot

Question

Suggestions for speeding up Nanoplot

SHuang-Broad opened this issue 2 years ago · comments

Hi,

We are aiming to lower the time spent on Nanoplot, which—using the following parameters—typically takes a few hours to finish.

NanoPlot \
    -t 16 \
    -c orangered \
    --N50 \
    --tsv_stats \
    --bam "${bam}"

I have collected an example resource usage, attached below.

Can you offer any insights as to where we can lower the runtime, other than allocating more threads? For example, do you expect the process to be IO bound? Or would keeping more reads in the memory (hence a parameter exposed) help?
Thanks!

Steve

Wouter De Coster · Answer 1 · Sun Jan 02 2022 20:36:01 GMT+0800 (China Standard Time)

Hi Steve,

That is an interesting question, and thank you for the profiling. Do you have the log file of this run? That contains timestamps and would help me to match e.g. the CPU usage with the steps in the code.

NanoPlot starts with extracting all data from the input file, in your case the bam file. It can use multiple threads to extract from multiple chromosomes simultaneously. I agree it would be great to make things faster...

Wouter

Steve Huang · Answer 2 · Sun Jan 02 2022 23:59:57 GMT+0800 (China Standard Time)

Hi Wouter,

Thanks for the quick reply!
The log file is available, but it doesn't seem to offer much insight? This is all we got

+ NanoPlot -t 16 -c orangered --N50 --tsv_stats --bam /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam
Unable to find block device for filesystem /dev/disk/by-id/google-local-disk.
Guessing present but unused sdb is the correct block device.
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam.bai
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam.bai
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam.bai

In the meantime, I just did another experiment with SSD, which didn't offer much improvement, indicating maybe it's not IO bound.

Oh, I forgot to mention: we are using the image
quay.io/biocontainers/nanoplot:1.35.5--pyhdfd78af_0? Should I turn on some debug flag (if so, how?)?

Thanks!
Steve

Steve Huang · Answer 3 · Wed Jan 05 2022 00:58:26 GMT+0800 (China Standard Time)

Hi Wouter,

I've dug a bit deeper, and it looks like for my use case (i.e. WGS bam), this line

https://github.com/wdecoster/nanoget/blob/e130bb016f7af7844e7d4145f05f62360ebcd6dd/nanoget/extraction_functions.py#L166

is for extracting information per chromosome? However, given this is a human BAM and the profiling is suggesting not all threads are being used?