clicumu / doepipeline

A python package for optimizing processing pipelines using statistical design of experiments (DoE).

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Restarting failed experiments during iteration, and to resume at a failed iteration

danisven opened this issue · comments

I'm currently trying to optimize parameters for Manta, a structural variant caller. There's a recurring issue that experiments fail, for example experiment 19 below:

RunManta_exp_19 has failed. (exit code 1:0)
Traceback (most recent call last):
  File "manta_sv.py", line 31, in <module>
    results = executor.run_pipeline_collection(pipeline)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\base.py", line 199, in run_pipeline_collection
    self.run_jobs(job_steps, experiment_index, env_variables, **kwargs)
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\mixins.py", line 349, in run_jobs
    self.wait_until_current_jobs_are_finished()
  File "C:\Users\dasw0002\AppData\Local\Continuum\Anaconda\envs\doepipeline\lib\site-packages\doepipeline-0.1-py3.5.egg\doepipeline\executor\base.py", line 246, in wait_until_current_jobs_are_finished
    raise PipelineRunFailed(msg)
doepipeline.executor.base.PipelineRunFailed: RunManta_exp_19 has failed. (exit code 1:0)

Checking the Manta log file I can see this:

[2016-09-27T13:24:09.175970] [m196.uppmax.uu.se] [61581_1] [TaskManager] Completed command task: 'generateCandidateSV_0066' launched from master workflow
[2016-09-27T13:24:52.698109] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR] Unhandled Exception in TaskManager-Thread
[2016-09-27T13:24:52.909386] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR] Traceback (most recent call last):
[2016-09-27T13:24:52.910425] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1660, in run
[2016-09-27T13:24:52.911376] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self._startTasks()
[2016-09-27T13:24:52.912096] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 526, in wrapped
[2016-09-27T13:24:52.912850] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     return f(self, *args, **kw)
[2016-09-27T13:24:52.913684] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1818, in _startTasks
[2016-09-27T13:24:52.914829] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self._launchTask(task)
[2016-09-27T13:24:52.916007] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1762, in _launchTask
[2016-09-27T13:24:52.917214] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     trun = self._getCommandTaskRunner(task)
[2016-09-27T13:24:52.918028] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1745, in _getCommandTaskRunner
[2016-09-27T13:24:52.918808] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     task.setRunstate)
[2016-09-27T13:24:52.919517] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1137, in __init__
[2016-09-27T13:24:52.920267] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     BaseTaskRunner.__init__(self, runStatus, taskStr, sharedFlowLog, setRunstate)
[2016-09-27T13:24:52.921161] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1041, in __init__
[2016-09-27T13:24:52.921949] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self.setInitialRunstate()
[2016-09-27T13:24:52.922960] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1079, in setInitialRunstate
[2016-09-27T13:24:52.923728] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self.setRunstate("running")
[2016-09-27T13:24:52.924557] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 1076, in setRunstate
[2016-09-27T13:24:52.925421] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self._setRunstate(*args, **kw)
[2016-09-27T13:24:52.926591] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 526, in wrapped
[2016-09-27T13:24:52.927825] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     return f(self, *args, **kw)
[2016-09-27T13:24:52.928734] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 2110, in setRunstate
[2016-09-27T13:24:52.929669] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     self.tdag.writeTaskStatus()
[2016-09-27T13:24:52.930562] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 526, in wrapped
[2016-09-27T13:24:52.931612] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     return f(self, *args, **kw)
[2016-09-27T13:24:52.932358] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 2475, in writeTaskStatus
[2016-09-27T13:24:52.933449] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     forceRename(tmpFile, self.taskStateFile)
[2016-09-27T13:24:52.934858] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]   File "/sw/apps/bioinfo/manta/1.0.0/milou/lib/python/pyflow/pyflow.py", line 170, in forceRename
[2016-09-27T13:24:52.935823] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR]     os.rename(src,dst)
[2016-09-27T13:24:52.936617] [m196.uppmax.uu.se] [61581_1] [TaskManager] [ERROR] OSError: [Errno 2] No such file or directory
[2016-09-27T13:25:07.711376] [m196.uppmax.uu.se] [61581_1] [WorkflowRunner] [ERROR] Workflow terminated due to unhandled exception in TaskManager

I believe there should be some kind of error-checking feature of doepipeline that detects that an experiment has failed and restarts it. I think it not too unlikely that this kind of spontaneous failing is restricted only to Manta, and could be a major issue for the usability of doepipeline in a range of different optimization problems.

For the other kind of problem, where the site you are performing your experiments at (in this case Uppmax) becomes unavailable, whether it being due to connection trouble or a planned down-time of the resource, I think there needs to be a feature for the user to resume the optimization after the last completed iteration.

/Daniel

@danisven can we close this?

We have implemented a restart function, so yes. Closing.