Stack trace when StartTarget action fails
degremont opened this issue · comments
On shine start, if the start action fails, shine crashes and the following stack trace is displayed:
# shine start -f fsiscsi -t mdt -L
Status of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) ...
Status of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) succeeded
Load module 'lustre' ...
Load module 'lustre': 'lustre' is already loaded
Load module 'ldiskfs' ...
Load module 'ldiskfs': 'ldiskfs' is already loaded
Start of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) ...
Start of fsiscsi-MDT0000 (/dev/disk/by-id/scsi-1IET_00030001) failed
>> mount.lustre: mount /dev/sdb at /mnt/fsiscsi/mdt/0 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)
Traceback (most recent call last):
File "/usr/sbin/shine", line 34, in <module>
sys.exit(Controller().run_command())
File "/usr/lib/python2.6/site-packages/Shine/Controller.py", line 254, in run_command
rc = command.filter_rc(command.execute())
File "/usr/lib/python2.6/site-packages/Shine/Commands/Base/FSLiveCommand.py", line 110, in execute
result = max(result, self.execute_fs(fs, fs_conf, eh, vlevel))
File "/usr/lib/python2.6/site-packages/Shine/Commands/Start.py", line 91, in execute_fs
tunings=Tune.get_tuning(fs_conf, fs.components))
File "/usr/lib/python2.6/site-packages/Shine/Lustre/FileSystem.py", line 531, in start
self._run_actions()
File "/usr/lib/python2.6/site-packages/Shine/Lustre/FileSystem.py", line 260, in _run_actions
task_self().resume()
File "/usr/lib/python2.6/site-packages/ClusterShell/Task.py", line 809, in resume
self._resume()
File "/usr/lib/python2.6/site-packages/ClusterShell/Task.py", line 772, in _resume
self._run(self.timeout)
File "/usr/lib/python2.6/site-packages/ClusterShell/Task.py", line 411, in _run
self._engine.run(timeout)
File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 677, in run
self.runloop(timeout)
File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/EPoll.py", line 178, in runloop
self.remove_stream(client, stream)
File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 471, in remove_stream
self.remove(client)
File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 454, in remove
self._remove(client, abort, did_timeout)
File "/usr/lib/python2.6/site-packages/ClusterShell/Engine/Engine.py", line 442, in _remove
client._close(abort=abort, timeout=did_timeout)
File "/usr/lib/python2.6/site-packages/ClusterShell/Worker/Popen.py", line 104, in _close
self.worker.eh.ev_close(self.worker)
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 425, in ev_close
self.set_status(ACT_ERROR)
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 156, in set_status
action.launch()
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 218, in launch
self._launch()
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 272, in _launch
if not self._graph_ok(self._members):
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 188, in _graph_ok
self.set_status(ACT_ERROR)
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 156, in set_status
action.launch()
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 218, in launch
self._launch()
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 272, in _launch
if not self._graph_ok(self._members):
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 177, in _graph_ok
dep.launch()
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 214, in launch
if not self._graph_ok(self.deps):
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Action.py", line 188, in _graph_ok
self.set_status(ACT_ERROR)
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Actions/Tune.py", line 109, in set_status
self._server.action_event(self, 'failed', result)
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Server.py", line 184, in action_event
self._del_action(act.NAME)
File "/usr/lib/python2.6/site-packages/Shine/Lustre/Server.py", line 167, in _del_action
self._running_actions.remove(act)
ValueError: list.remove(x): x not in list
This stack trace comes from the event handling mechanism brought into the Tune action by fix for ticket #50 (commit f1d5adaadf4faaa41f3f813d865dbc0951120f35)
Method set_status() calls self._server.action_event('failed') which in turn calls Server._del_action().
The problem is there has never been a call to Server.action_event('start') before,
so Server._running_actions is empty. Hence the exception.
A quick fix could be to test existence of the action before removing it, or enclose it in a try..except block.
However, I suspect a dirty behaviour in Tune.set_status() (indeed, it is the only class calling action_event() inside set_status()), but I can't find out what would be the correct behaviour here.
Regards.
Reported by: theryf
I made a mistake in the ticket title, this should be:
"Stack trace on shine start when StartTarget action fails"
Could you please change it for me, as it seems I don't have sufficient privileges to do it myself.
Thanks
Original comment by: theryf
- summary: Stack trace when executing shine update --> Stack trace when StartTarget action fails
- status: new --> open
Original comment by: degremont
The start event should have been sent by Tune._launch(). I'm trying to understand why this is not the case in your stack trace.
Reproduce this stack with -vv could be interesting.
Original comment by: degremont
The -vv option does not give more details except the command run by shine.
The stack trace is exactly the same, and does not provide any hints about why
Tune._launch() is not called before the call to self._run_actions() inside FileSystem.start().
It is quite hard to follow the recursive callflow of _graph_ok/launch/_launch triggered by the first actions.launch() in FileSystem.start().
Original comment by: theryf
The -vv option does not give more details except the command run by shine.
It was exactly what I was interested in.
Original comment by: degremont
Here is the fix. I will be working on a more complete version of this patch.
diff --git a/lib/Shine/Lustre/Actions/Tune.py b/lib/Shine/Lustre/Actions/Tune.py
index f52e5cc..14ab1f3 100644
--- a/lib/Shine/Lustre/Actions/Tune.py
+++ b/lib/Shine/Lustre/Actions/Tune.py
@@ -97,16 +97,17 @@ class Tune(ActionGroup):
If this is a final state, raise corresponding events.
"""
- if status == ACT_OK:
- self._server.action_event(self, 'done')
- elif status == ACT_ERROR:
- # Build an error string
- errors = []
- for act in self:
- if act.status() == ACT_ERROR:
- errors.append("'%s' failed" % act._command)
- result = ErrorResult("\n".join(errors))
- self._server.action_event(self, 'failed', result)
+ if self._init:
+ if status == ACT_OK:
+ self._server.action_event(self, 'done')
+ elif status == ACT_ERROR:
+ # Build an error string
+ errors = []
+ for act in self:
+ if act.status() == ACT_ERROR:
+ errors.append("'%s' failed" % act._command)
+ result = ErrorResult("\n".join(errors))
+ self._server.action_event(self, 'failed', result)
~~~~
Original comment by: degremont
- status: open --> closed
- assigned_to: Aurélien Degrémont
- Resolution: --> fixed
Original comment by: degremont
Fixed in [a136d1]
Original comment by: degremont