clicumu / doepipeline

A python package for optimizing processing pipelines using statistical design of experiments (DoE).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

possible number of "qos=short" jobs passed to SLURM is capped

danisven opened this issue · comments

UPPMAX seem to cap the number of possible qos=short jobs that you can have in the queue. I think this number is set to 10. I receive the following error when trying to submit a CCC design with three factors (14 exp):

$ python gatk_snp_execute.py
Traceback (most recent call last):
  File "gatk_snp_execute.py", line 18, in <module>
    results = executor.run_pipeline_collection(pipeline)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/base.py", line 199, in run_pipeline_collection
    self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/mixins.py", line 334, in run_jobs
    _, stdout, _ = self.execute_command(command, job_name=exp_name)
  File "/media/data/db/data/anaconda/envs/doe_pipeline/lib/python3.5/site-packages/doepipeline-0.1-py3.5.egg/doepipeline/executor/remote.py", line 159, in execute_command
    raise CommandError('\n'.join(err))
doepipeline.executor.base.CommandError: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

This should be possible to fix with a try/except-clause when submitting the jobs. If the job submission is rejected the job should remain in an internal queue for submission later.

If this is added I think that it should be optional.

In the example case where too many --qos=short-jobs are passed it is according to me a SLURM-user error. Which is something I would like to have thrown at me in the form of an error.

This is something that is quite infrastructure specific (i.e. Uppmax rule) so I also think throwing an error would be better. That way it is up to the user to know rules of the infrastructure in question.

SLURM-specific functionality removed, closes issue.