ipython / ipyparallel

IPython Parallel: Interactive Parallel Computing in Python

Home Page:https://ipyparallel.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

More user-friendly errors and automatic restarts in case of engines crashing due to OOM

sahil1105 opened this issue · comments

The errors we report in case of OOM and Segmentation-Fault are now much better, but I was wondering is there a way to make them more "user-friendly"?

  1. Currently, at least for the MPI case, we report the mpiexec output, which is great, but could there be a way to report a cleaner error in addition to this, that could clearly identify this as a OOM error (or a seg-fault if possible)?
  2. Is there something that packages (like Bodo) could do to make this experience better/easier?
  3. What's the best way to automate restart of engines in this case? Ideally, if enabled, in cases where the engines crash, if we could clean up the processes, display a message (e.g. "engines crashed due to OOM, restarting engines..."), and then restart the engines, that would be useful.

I think it's hard to do this in general such that it fits in the base class, but Launchers have two relevant methods:

  1. _log_output which is called on stop. This is what logs the mpi errors. You can override this in your custom Launcher to do further processing/parsing of the output to change what's logged by default instead of or in addition to the current MPI output
  2. Launcher.on_stop allows registering arbitrary stop callbacks. example notebook.

If you already have a custom launcher, you can combine these to add self.on_stop(self.custom_log_message) at the end of .start() to always add your own custom stop handlers.

Thanks @minrk! Will try this out.

@minrk Any feedback on the automatic restart setup?

Sorry, missed that part. Automatic restart could possibly also be achieved through the on_stop callback. The question becomes whether it makes sense to restart the same engine set vs starting a new one. Restarting in-place would probably feel cleaner, but likely would also make debugging more challenging (e.g. losing handles on the logs for the crashed engines). Starting a new engine set is simpler, because you only need to call cluster.start_engines(n).

I think it's reasonable for restart-on-fail to be a built-in feature for Engine[Set]Launcher, but it should be possible now via on_stop.

Thanks @minrk! Will try out building restart in a custom launcher.
Will also open a separate issue for built-in restart support.


UPDATE: Opened this issue: #706