Unable to customize resources for each step using slurm
tirohia opened this issue · comments
When submitting a job to a cluster using slurm, the options at the top of bpipe.config get found and translated properly. Any options within the command brackets are ignored.
i.e. bpipe.config with the following, results in a job submitted with a 4 hour time limit, 8 processors,e ach with 16 GB of memory.
executor="slurm"
queue="bigmem"
walltime="04:00:00"
account="uoa02461"
procs=8
memory=16
Whereas the following results in a job being submitted with default parameters - 1 hour time limit, 2 processors.
executor="slurm"
queue="bigmem"
commands {
trim {
procs="8"
walltime="02:00:00"
memory="16"
modules="cutadapt"
}
}
It appears to pick up and load the modules from inside the command block though.
I'm assuming the issue is somewhere within https://github.com/ssadedin/bpipe/blob/master/bin/bpipe-slurm.sh? I haven't been able to figure it out yet though.
Looks like slurm needs to be converted to the template-based executor model:
https://github.com/ssadedin/bpipe/tree/master/src/main/templates/bpipe/executor
Yes, it does need migrating, but this definitely shouldn't be broken all the same. I will look into it. Unfortunately I have lost access to my slurm test cluster so I may be a bit reliant on some help testing it.
I think the problem is equally likely to lie in the Slurm groovy code though:
Thanks for reporting the problem!
I have access to a slurm cluster for testing
Ditto. I'm not familiar with the codebase for bpipe (yet) but if I can help with testing or anything, I'm happy to help.
I had a quick test of this and it seemed to "do the right thing" as far as I could verify, though there's one aspect I'm curious about. When bpipe goes to run the job, it should be writing a file in .bpipe/commandtmp/<id>/job.slurm
. From that you will be able to see the exact parameters that Bpipe tried to send to SLURM. So one of two scenarios is happening - Bpipe is not resolving the right parameters at all, or alternatively, Bpipe is resolving the parameters, but however it tries to send them doesn't work. Could you take a look and see if you can find the job file and see if you can figure out which scenario is true? Thanks!
'pologies for the delay.
so my bpipe.config is currently:
executor="slurm"
queue="bigmem"
//walltime="04:00:00"
//custom_submit_options="your option here"
account="uoa02461"
//procs=8
//memory=15
TMP="tmp"
modules="BEDTools FastQC cutadapt SAMtools picard BWA"
commands {
trim {
procs="16"
walltime="02:00:00"
memory="15"
}
otherthings{
}
}
When I look down in the commandtmp folder the generated slurm job is:
#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --account uoa02461
#SBATCH --mem=4096
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH -p bigmem
set -o errexit
module load BEDTools
module load FastQC
module load cutadapt
module load SAMtools
module load picard
module load BWA
(cutadapt --minimum-length 50 -a AGATCGGAAGAGC -q 30 -o sample_R1.fq.gz.trim -p sample_R1.fq.gz.2.trim raw/sample_R1.fq.gz raw/sample_R2.fq.gz) > .bpipe/commandtmp/81/81.out 2> .bpipe/commandtmp/81/81.err
Which, given that the trim job specifies a wall time of 2 hours, suggests to me that it's not picking it up from the commands section when it's creating the slurm file.
Been doing some digging. Not sure how useful it will be.
bpipe-slurm.sh, is, I think, only receiving some values into the OPTIONAL_ENV_VARS. I can specify the variables in the named section of the bpipe.config and the only values that come in to OPTIONAL_ENV_VARS are the PROCS, QUEUE and JOBDIR.
- PROCS remains as 1, whatever number I specify in the bpipe.config. So this looks like it's picking the value up from the default, but I'm printing out the contents of OPETIONAL_ENV_VARS before the variables are tested to see whether they're null or not.
- QUEUE has the correct queue in it, so it's picking that up from the top level of bpipe.config, as I've not attempted to specify that anywhere else.
- JOBDIR I'm assuming is automatically generated.
I put a loop bpipe-slurm.sh to show all the OPTIONAL_ENV_VARS, pretty much the same as for the ESSENTIAL_ENV_VARS :
for v in $OPTIONAL_ENV_VARS; do
echo $v >> /nesi/project/uoa00571/src/debug
eval "k=\$$v"
echo $k >> /nesi/project/uoa00571/src/debug
done
Which returns :
WALLTIME
PROCS
1
QUEUE
bigmem
JOBDIR
/scale_wlg_persistent/filesets/project/uoa00571/src/.bpipe/commandtmp/614
JOBTYPE
MEMORY
CUSTOM_SUBMIT_OPTS
Long story short, I think either OPTIONAL_ENV_VARS isn't getting loaded properly or it's not getting passed properly to bpipe-slurm.sh. I can't see where it's loaded and/or where bpipe-slurm.sh is called from though, so I'll have to continue looking.
Just had another look at the bpipe.config
and generated slurm job - there's definitely a disconnect there. It makes me realise something I should have checked much earlier: is Bpipe matching your commands to the configs correctly?
The important thing is - you have for example a confiuration for trim
which matches to the cutadapt
command. For those two to get matched, you would need to explicitly name the configuration trim
in your exec
statement. For example:
trim = {
exec """
cutadapt ....
""","trim" // <==== this is needed!
}
Can you confirm you have that last part?
That's definitely part of it. I definitely didn't have that. I've added that in, so in my pipeline file I have:
trim = {
output.dir="data/intermediateFiles"
multi "cutadapt -a AGATCGGAAGAGC -q 30 --minimum-length=50 -o $output1 -p $output2 $input1 $input2",
"cutadapt -a AGATCGGAAGAGC -q 30 --minimum-length=50 -o $output3 -p $output4 $input3 $input4","trim"
}
And you're right, that does make it pick up the correct section from the bpipe.config. Sort of. It generates 3 slurm job files. Two of those are still picking up the defaults. i.e.
#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --account uoa02461
#SBATCH --mem=8096
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH -p bigmem
set -o errexit
module load cutadapt
(cutadapt -a AGATCGGAAGAGC -q 30 --minimum-length=50 -o data/intermediateFiles/sampled_R1.fq.gz.trim -p data/intermediateFiles/sampled_R1.fq.gz.2.trim ../data/sampled_R1.fq.gz ../data/sampled_R2.fq.gz) > .bpipe/commandtmp/619/619.out 2>
.bpipe/commandtmp/619/619.err
These two files, if you take them out of the pipeline and submit them direct to slurm, work.
The other file, picks up the settings from the bpipe.config, but it's not a slurm file that will work.
#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --account uoa02461
#SBATCH --mem=4gb
#SBATCH --time=0:10:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH -p bigmem
set -o errexit
module load cutadapt
(trim) > .bpipe/commandtmp/621/621.out 2> .bpipe/commandtmp/621/621.err
So adding that trim bit at the end of the command in the pipeline trim file appears to generate an additional slurm file with the correct settings in addition to the files with the incorrect settings. Or rather, that's what I think is happening. I'll have a closer look tomorrow.
This, of course, is assuming I've added the tag to the command in the pipeline section correctly. Also, I can't see any reference to it in the documentation, is it not there or have I just missed it?
That sounds promising. Are the old / incorrect ones just left over from before? Bpipe wouldn't delete them unless you do. Thanks for following up!
Nope, the new incorrect files are very definitely newly generated. Almost as if it's treating the extra tag in the pipeline file as a 3rd instruction for which a slurm file/job needs to generated now that I think about it.
More than happy to follow stuff up, I've got several hundred samples to put through a pipeline in the next couple of months, I'm very much hoping to get this working :)
Looking closer, yeah, it's definitely generating a 3rd slurm job. I've been looking through the jobs with sacct after running the pipeline, I get something like this:```
1730112 trim 00:00:02 00:01.174 2 COMPLETED
1730112.batch batch 00:00:02 00:01.174 2 528K COMPLETED
1730112.extern extern 00:00:02 00:00:00 2 4K COMPLETED
1730113 trim 00:00:01 00:00.445 2 FAILED
1730113.batch batch 00:00:01 00:00.444 2 529K FAILED
1730113.extern extern 00:00:01 00:00:00 2 24K COMPLETED
1730114 trim 00:00:02 00:01.161 2 COMPLETED
1730114.batch batch 00:00:02 00:01.160 2 527K COMPLETED
1730114.extern extern 00:00:02 00:00:00 2 24K COMPLETED
If I look in the commandtmp logs, then in the slurm job with the malformed slurm job, in the error log, I have this:
`/var/spool/slurm/job1730113/slurm_script: line 15: trim: command not found
`
So without the extra trim command, it won't pick up the settings for slurm. With it, it produces an extra slurm job, which fails, causes the entire step to fail and halts the pipeline.
So. Here's a thing.
It might have something to do with multi. I split my multi command into two exec commands, and added the "trim" tag on to the end of each line. It's looks like it's generated the correct slurm files and run them. I suspect the innards of multi vs execute may be beyond me.
The downside of this is that it if the stage has two jobs, it appears to wait for the first slurm job to complete before it creates and submits the 2nd job. Which isn't the terrible, though it's not ideal he mused.