grid options, meryl-count and disk quota
brunocontrerasmoreira opened this issue · comments
Hi, I am testing canu in a slurm Linux cluster for the first time with 2.5TB HiFi compressed reads. This is the bash script I submitted with sbatch:
#SBATCH --mem=4G
#SBATCH --time=6-24:00:00
module load java
$HOME/soft/canu-2.2/build/bin/canu -p Tt -d Tt genomeSize=17g useGrid=true gridOptions='--mem-per-cpu=24G' -pacbio-hifi $HOME/fastq/*
The stderr of this job contains:
...
-- Slurm support detected. Resources available:
-- 126 hosts with 52 cores and 182 GB memory.
-- 24 hosts with 128 cores and 1854 GB memory.
-- 4 hosts with 128 cores and 878 GB memory.
-- 155 hosts with 128 cores and 438 GB memory.
-- 34 hosts with 256 cores and 683 GB memory.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ---------- -------- -----------------------------
-- Grid: meryl 24.000 GB 8 CPUs (k-mer counting)
-- Grid: hap 16.000 GB 16 CPUs (read-to-haplotype assignment)
-- Grid: cormhap 42.000 GB 16 CPUs (overlap detection with mhap)
-- Grid: obtovl 24.000 GB 16 CPUs (overlap detection)
-- Grid: utgovl 24.000 GB 16 CPUs (overlap detection)
-- Grid: cor -.--- GB 4 CPUs (read correction)
-- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32.000 GB 1 CPU (overlap store sorting)
-- Grid: red 32.000 GB 10 CPUs (read error detection)
-- Grid: oea 8.000 GB 1 CPU (overlap error adjustment)
-- Grid: bat 1024.000 GB 64 CPUs (contig construction with bogart)
-- Grid: cns -.--- GB 8 CPUs (consensus)
--
-- Found PacBio HiFi reads in 'Apin.seqStore':
-- Libraries:
-- PacBio HiFi: 20
-- Reads:
-- Corrected: 3400000015033
-- Corrected and Trimmed: 3400000015033
...
-- BEGIN ASSEMBLY
--
-- Running jobs. First attempt out of 2.
--
-- 'meryl-count.jobSubmit-01.sh' -> job 7050352 tasks 1-96.
However, the meryl-count jobs fail; here's the last line of meryl-count.7051213_65.out
:
Failed to open './Apin.65.meryl.WORKING/0x001110[066].merylData' for writing: Disk quota exceeded
When I checked the folder where this job was running I see a large number of files:
ls Apin.65.meryl.WORKING | wc -l
8334
How can I change the slurm settings to:
-
reduce the number of temp files created by meryl-count
-
reduce the number of running meryl-count jobs at a time
Thanks for your help
Good morning,
I am a member of the supercomputing center where Bruno is running this program. I add one more question to this thread:
is there a way to have these temporary files generated on the local scratch of the nodes of a supercomputer?
Regards,
David
I found out that meryl was being invoked with max 21G of RAM despite me allowing more RAM in gridOptions:
/path/to/canu-2.2/build/bin/meryl k=22 threads=8 memory=21 \
count \
segment=$jobid/96 ../../Apin.seqStore \
output ./Apin.$jobid.meryl.WORKING \
If I edit this script and increase to memory=128G then the number of temp files of every array job was <100
The gridOptions is just passed through, it shouldn't be used to request resources as canu does that automatically on a per-job basis. See https://canu.readthedocs.io/en/latest/parameter-reference.html# for more details. You can specify meryl memory and threads instead which would update the above script and would also request the approximate memory from the grid. You can also limit concurrent jobs by modifying grid array parameters (gridEngineArrayOption="-a ARRAY_JOBS%4" on slurm would limit canu to at most 4 concurrent jobs).
As for local disk, there is an option for staging https://canu.readthedocs.io/en/latest/parameter-reference.html#file-staging but it isn't used for this step as it's usually not an I/O issue compared to later steps. If it's running out of space already here, it's likely you'll need significantly more disk space. A human genome w/HiFi at 40x requires about 200gb to compute, given your genome is much larger and likely more repetitive, I'd count on at least 2 tb of space being available to run.
Hi @skoren , we managed to get the meryl-count job done by increasing disk quota and giving it more RAM. The resulting folder 0-mercounts/
takes 3.4T of disk, can this help estimate how much disk space we need for the remaining jobs?
Thanks