possible number of "qos=short" jobs passed to SLURM is capped
danisven opened this issue · comments
UPPMAX seem to cap the number of possible qos=short jobs that you can have in the queue. I think this number is set to 10. I receive the following error when trying to submit a CCC design with three factors (14 exp):
$ python gatk_snp_execute.py
Traceback (most recent call last):
File "gatk_snp_execute.py", line 18, in <module>
results = executor.run_pipeline_collection(pipeline)
File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/base.py", line 199, in run_pipeline_collection
self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/mixins.py", line 334, in run_jobs
_, stdout, _ = self.execute_command(command, job_name=exp_name)
File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/remote.py", line 159, in execute_command
raise CommandError('\n'.join(err))
doepipeline.executor.base.CommandError: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
This should be possible to fix with a try/except-clause when submitting the jobs. If the job submission is rejected the job should remain in an internal queue for submission later.
If this is added I think that it should be optional.
In the example case where too many --qos=short
-jobs are passed it is according to me a SLURM-user error. Which is something I would like to have thrown at me in the form of an error.
This is something that is quite infrastructure specific (i.e. Uppmax rule) so I also think throwing an error would be better. That way it is up to the user to know rules of the infrastructure in question.
SLURM-specific functionality removed, closes issue.