More user-friendly errors and automatic restarts in case of engines crashing due to OOM

Question

More user-friendly errors and automatic restarts in case of engines crashing due to OOM

sahil1105 opened this issue 2 years ago · comments

The errors we report in case of OOM and Segmentation-Fault are now much better, but I was wondering is there a way to make them more "user-friendly"?

Currently, at least for the MPI case, we report the mpiexec output, which is great, but could there be a way to report a cleaner error in addition to this, that could clearly identify this as a OOM error (or a seg-fault if possible)?
Is there something that packages (like Bodo) could do to make this experience better/easier?
What's the best way to automate restart of engines in this case? Ideally, if enabled, in cases where the engines crash, if we could clean up the processes, display a message (e.g. "engines crashed due to OOM, restarting engines..."), and then restart the engines, that would be useful.

Min RK · Answer 1 · Thu Apr 21 2022 16:52:39 GMT+0800 (China Standard Time)

I think it's hard to do this in general such that it fits in the base class, but Launchers have two relevant methods:

_log_output which is called on stop. This is what logs the mpi errors. You can override this in your custom Launcher to do further processing/parsing of the output to change what's logged by default instead of or in addition to the current MPI output
Launcher.on_stop allows registering arbitrary stop callbacks. example notebook.

If you already have a custom launcher, you can combine these to add self.on_stop(self.custom_log_message) at the end of .start() to always add your own custom stop handlers.

Sahil Gupta · Answer 2 · Wed Apr 27 2022 09:13:28 GMT+0800 (China Standard Time)

Thanks @minrk! Will try this out.

Sahil Gupta · Answer 3 · Wed Apr 27 2022 09:31:03 GMT+0800 (China Standard Time)

@minrk Any feedback on the automatic restart setup?

Min RK · Answer 4 · Wed Apr 27 2022 15:27:55 GMT+0800 (China Standard Time)

Sorry, missed that part. Automatic restart could possibly also be achieved through the on_stop callback. The question becomes whether it makes sense to restart the same engine set vs starting a new one. Restarting in-place would probably feel cleaner, but likely would also make debugging more challenging (e.g. losing handles on the logs for the crashed engines). Starting a new engine set is simpler, because you only need to call cluster.start_engines(n).

I think it's reasonable for restart-on-fail to be a built-in feature for Engine[Set]Launcher, but it should be possible now via on_stop.

Sahil Gupta · Answer 5 · Fri May 06 2022 11:12:06 GMT+0800 (China Standard Time)

Thanks @minrk! Will try out building restart in a custom launcher.
Will also open a separate issue for built-in restart support.

UPDATE: Opened this issue: #706