wdecoster / NanoPlot

Plotting scripts for long read sequencing data

Home Page:http://nanoplot.bioinf.be

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suggestions for speeding up Nanoplot

SHuang-Broad opened this issue · comments

Hi,

We are aiming to lower the time spent on Nanoplot, which—using the following parameters—typically takes a few hours to finish.

NanoPlot \
    -t 16 \
    -c orangered \
    --N50 \
    --tsv_stats \
    --bam "${bam}"

I have collected an example resource usage, attached below.

nanoplot resources

Can you offer any insights as to where we can lower the runtime, other than allocating more threads? For example, do you expect the process to be IO bound? Or would keeping more reads in the memory (hence a parameter exposed) help?
Thanks!

Steve

Hi Steve,

That is an interesting question, and thank you for the profiling. Do you have the log file of this run? That contains timestamps and would help me to match e.g. the CPU usage with the steps in the code.

NanoPlot starts with extracting all data from the input file, in your case the bam file. It can use multiple threads to extract from multiple chromosomes simultaneously. I agree it would be great to make things faster...

Wouter

Hi Wouter,

Thanks for the quick reply!
The log file is available, but it doesn't seem to offer much insight? This is all we got

+ NanoPlot -t 16 -c orangered --N50 --tsv_stats --bam /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam
Unable to find block device for filesystem /dev/disk/by-id/google-local-disk.
Guessing present but unused sdb is the correct block device.
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam.bai
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam.bai
[W::hts_idx_load3] The index file is older than the data file: /cromwell_root/broad-gp-pacbio-outgoing/results/PBFlowcell/m64297e_211125_022925/reads/ccs/aligned/m64297e_211125_022925.bam.bai

In the meantime, I just did another experiment with SSD, which didn't offer much improvement, indicating maybe it's not IO bound.
nanoplot resources local

Oh, I forgot to mention: we are using the image
quay.io/biocontainers/nanoplot:1.35.5--pyhdfd78af_0? Should I turn on some debug flag (if so, how?)?

Thanks!
Steve

Hi Wouter,

I've dug a bit deeper, and it looks like for my use case (i.e. WGS bam), this line

https://github.com/wdecoster/nanoget/blob/e130bb016f7af7844e7d4145f05f62360ebcd6dd/nanoget/extraction_functions.py#L166

is for extracting information per chromosome? However, given this is a human BAM and the profiling is suggesting not all threads are being used?