Unable to customize resources for each step using slurm

Question

Unable to customize resources for each step using slurm

tirohia opened this issue 6 years ago · comments

When submitting a job to a cluster using slurm, the options at the top of bpipe.config get found and translated properly. Any options within the command brackets are ignored.

i.e. bpipe.config with the following, results in a job submitted with a 4 hour time limit, 8 processors,e ach with 16 GB of memory.
executor="slurm"
queue="bigmem"
walltime="04:00:00"
account="uoa02461"
procs=8
memory=16

Whereas the following results in a job being submitted with default parameters - 1 hour time limit, 2 processors.

executor="slurm"
queue="bigmem"
commands {
trim {
procs="8"
walltime="02:00:00"
memory="16"
modules="cutadapt"

}

It appears to pick up and load the modules from inside the command block though.
I'm assuming the issue is somewhere within https://github.com/ssadedin/bpipe/blob/master/bin/bpipe-slurm.sh? I haven't been able to figure it out yet though.

Gabriel A. Devenyi · Answer 1 · Fri Oct 19 2018 22:12:33 GMT+0800 (China Standard Time)

Looks like slurm needs to be converted to the template-based executor model:
https://github.com/ssadedin/bpipe/tree/master/src/main/templates/bpipe/executor

Simon Sadedin · Answer 2 · Sun Oct 21 2018 08:59:01 GMT+0800 (China Standard Time)

Yes, it does need migrating, but this definitely shouldn't be broken all the same. I will look into it. Unfortunately I have lost access to my slurm test cluster so I may be a bit reliant on some help testing it.

I think the problem is equally likely to lie in the Slurm groovy code though:

https://github.com/ssadedin/bpipe/blob/master/src/main/groovy/bpipe/executor/SlurmCommandExecutor.groovy

Thanks for reporting the problem!

Gabriel A. Devenyi · Answer 3 · Mon Oct 22 2018 01:42:26 GMT+0800 (China Standard Time)

I have access to a slurm cluster for testing

Ben · Answer 4 · Wed Oct 24 2018 06:12:54 GMT+0800 (China Standard Time)

Ditto. I'm not familiar with the codebase for bpipe (yet) but if I can help with testing or anything, I'm happy to help.

Simon Sadedin · Answer 5 · Wed Oct 24 2018 07:04:35 GMT+0800 (China Standard Time)

I had a quick test of this and it seemed to "do the right thing" as far as I could verify, though there's one aspect I'm curious about. When bpipe goes to run the job, it should be writing a file in .bpipe/commandtmp/<id>/job.slurm. From that you will be able to see the exact parameters that Bpipe tried to send to SLURM. So one of two scenarios is happening - Bpipe is not resolving the right parameters at all, or alternatively, Bpipe is resolving the parameters, but however it tries to send them doesn't work. Could you take a look and see if you can find the job file and see if you can figure out which scenario is true? Thanks!

Ben · Answer 6 · Thu Nov 01 2018 07:25:58 GMT+0800 (China Standard Time)

'pologies for the delay.

so my bpipe.config is currently:

executor="slurm"
queue="bigmem"
//walltime="04:00:00"
//custom_submit_options="your option here"
account="uoa02461"
//procs=8
//memory=15

TMP="tmp"

modules="BEDTools FastQC cutadapt SAMtools picard BWA"

commands {
        trim {  
                procs="16"
                walltime="02:00:00"
                memory="15"
        }
        otherthings{
        }
}

When I look down in the commandtmp folder the generated slurm job is:

#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --account uoa02461
#SBATCH --mem=4096
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH -p bigmem

set -o errexit


module load BEDTools
module load FastQC
module load cutadapt
module load SAMtools
module load picard
module load BWA

 (cutadapt --minimum-length 50 -a AGATCGGAAGAGC -q 30  -o sample_R1.fq.gz.trim -p sample_R1.fq.gz.2.trim raw/sample_R1.fq.gz raw/sample_R2.fq.gz) > .bpipe/commandtmp/81/81.out 2>  .bpipe/commandtmp/81/81.err

Which, given that the trim job specifies a wall time of 2 hours, suggests to me that it's not picking it up from the commands section when it's creating the slurm file.

Ben · Answer 7 · Thu Dec 20 2018 12:55:05 GMT+0800 (China Standard Time)

Been doing some digging. Not sure how useful it will be.

bpipe-slurm.sh, is, I think, only receiving some values into the OPTIONAL_ENV_VARS. I can specify the variables in the named section of the bpipe.config and the only values that come in to OPTIONAL_ENV_VARS are the PROCS, QUEUE and JOBDIR.

PROCS remains as 1, whatever number I specify in the bpipe.config. So this looks like it's picking the value up from the default, but I'm printing out the contents of OPETIONAL_ENV_VARS before the variables are tested to see whether they're null or not.
QUEUE has the correct queue in it, so it's picking that up from the top level of bpipe.config, as I've not attempted to specify that anywhere else.
JOBDIR I'm assuming is automatically generated.

I put a loop bpipe-slurm.sh to show all the OPTIONAL_ENV_VARS, pretty much the same as for the ESSENTIAL_ENV_VARS :

for v in $OPTIONAL_ENV_VARS; do
      echo $v >> /nesi/project/uoa00571/src/debug
      eval "k=\$$v"
      echo $k >> /nesi/project/uoa00571/src/debug
   done

Which returns :

WALLTIME

PROCS
1
QUEUE
bigmem
JOBDIR
/scale_wlg_persistent/filesets/project/uoa00571/src/.bpipe/commandtmp/614
JOBTYPE

MEMORY

CUSTOM_SUBMIT_OPTS

Long story short, I think either OPTIONAL_ENV_VARS isn't getting loaded properly or it's not getting passed properly to bpipe-slurm.sh. I can't see where it's loaded and/or where bpipe-slurm.sh is called from though, so I'll have to continue looking.

Simon Sadedin · Answer 8 · Sat Dec 22 2018 15:29:06 GMT+0800 (China Standard Time)

Just had another look at the bpipe.config and generated slurm job - there's definitely a disconnect there. It makes me realise something I should have checked much earlier: is Bpipe matching your commands to the configs correctly?

The important thing is - you have for example a confiuration for trim which matches to the cutadapt command. For those two to get matched, you would need to explicitly name the configuration trim in your exec statement. For example:

trim = {
    exec """
        cutadapt ....
    ""","trim"  // <==== this is needed!
}

Can you confirm you have that last part?

Ben · Answer 9 · Wed Dec 26 2018 16:47:52 GMT+0800 (China Standard Time)

That's definitely part of it. I definitely didn't have that. I've added that in, so in my pipeline file I have:

trim = {
    output.dir="data/intermediateFiles"
    multi "cutadapt -a AGATCGGAAGAGC -q 30 --minimum-length=50 -o $output1 -p $output2 $input1 $input2",
       "cutadapt -a AGATCGGAAGAGC -q 30 --minimum-length=50 -o $output3 -p $output4 $input3 $input4","trim" 
}

And you're right, that does make it pick up the correct section from the bpipe.config. Sort of. It generates 3 slurm job files. Two of those are still picking up the defaults. i.e.

#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --account uoa02461
#SBATCH --mem=8096
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH -p bigmem

set -o errexit

module load cutadapt

 (cutadapt -a AGATCGGAAGAGC -q 30 --minimum-length=50 -o data/intermediateFiles/sampled_R1.fq.gz.trim -p data/intermediateFiles/sampled_R1.fq.gz.2.trim ../data/sampled_R1.fq.gz ../data/sampled_R2.fq.gz) > .bpipe/commandtmp/619/619.out 2> 
 .bpipe/commandtmp/619/619.err

These two files, if you take them out of the pipeline and submit them direct to slurm, work.
The other file, picks up the settings from the bpipe.config, but it's not a slurm file that will work.

#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --account uoa02461
#SBATCH --mem=4gb
#SBATCH --time=0:10:00
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH -p bigmem

set -o errexit

module load cutadapt

 (trim) > .bpipe/commandtmp/621/621.out 2>  .bpipe/commandtmp/621/621.err

So adding that trim bit at the end of the command in the pipeline trim file appears to generate an additional slurm file with the correct settings in addition to the files with the incorrect settings. Or rather, that's what I think is happening. I'll have a closer look tomorrow.
This, of course, is assuming I've added the tag to the command in the pipeline section correctly. Also, I can't see any reference to it in the documentation, is it not there or have I just missed it?

Simon Sadedin · Answer 10 · Wed Dec 26 2018 19:03:00 GMT+0800 (China Standard Time)

That sounds promising. Are the old / incorrect ones just left over from before? Bpipe wouldn't delete them unless you do. Thanks for following up!

Ben · Answer 11 · Thu Dec 27 2018 09:44:22 GMT+0800 (China Standard Time)

Nope, the new incorrect files are very definitely newly generated. Almost as if it's treating the extra tag in the pipeline file as a 3rd instruction for which a slurm file/job needs to generated now that I think about it.

More than happy to follow stuff up, I've got several hundred samples to put through a pipeline in the next couple of months, I'm very much hoping to get this working :)

Ben · Answer 12 · Fri Jan 04 2019 09:48:15 GMT+0800 (China Standard Time)

Looking closer, yeah, it's definitely generating a 3rd slurm job. I've been looking through the jobs with sacct after running the pipeline, I get something like this:```

1730112 trim 00:00:02 00:01.174 2 COMPLETED
1730112.batch batch 00:00:02 00:01.174 2 528K COMPLETED
1730112.extern extern 00:00:02 00:00:00 2 4K COMPLETED
1730113 trim 00:00:01 00:00.445 2 FAILED
1730113.batch batch 00:00:01 00:00.444 2 529K FAILED
1730113.extern extern 00:00:01 00:00:00 2 24K COMPLETED
1730114 trim 00:00:02 00:01.161 2 COMPLETED
1730114.batch batch 00:00:02 00:01.160 2 527K COMPLETED
1730114.extern extern 00:00:02 00:00:00 2 24K COMPLETED

If I look in the commandtmp logs, then in the slurm job with the malformed slurm job, in the error log, I have this:
`/var/spool/slurm/job1730113/slurm_script: line 15: trim: command not found
`
So without the extra trim command, it won't pick up the settings for slurm. With it, it produces an extra slurm job, which fails, causes the entire step to fail and halts the pipeline.

Ben · Answer 13 · Fri Jan 04 2019 10:31:47 GMT+0800 (China Standard Time)

So. Here's a thing.

It might have something to do with multi. I split my multi command into two exec commands, and added the "trim" tag on to the end of each line. It's looks like it's generated the correct slurm files and run them. I suspect the innards of multi vs execute may be beyond me.

The downside of this is that it if the stage has two jobs, it appears to wait for the first slurm job to complete before it creates and submits the 2nd job. Which isn't the terrible, though it's not ideal he mused.