Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Home Page:https://lightning.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

KeyboardInterrupt raises an exception which results in a zero exit code

amarckal opened this issue · comments

Bug description

During training whenever there is a keyboard interrupt the fit loop raises a SIGTERMException

if trainer.received_sigterm:
raise SIGTERMException

which results in a 0 exit code. Other scripts relying on the exit code of the training script pick this up as if the training script has exited normally.

The issue comes from here:

class SIGTERMException(SystemExit):
"""Exception used when a :class:`signal.SIGTERM` is sent to a process.
This exception is raised by the loops at specific points. It can be used to write custom logic in the
:meth:`lightning.pytorch.callbacks.callback.Callback.on_exception` method.
For example, you could use the :class:`lightning.pytorch.callbacks.fault_tolerance.OnExceptionCheckpoint` callback
that saves a checkpoint for you when this exception is raised.
"""

raising a SystemExit in python without specifying the exit code, has the code set to None which gets converted to 0. The fix would be to have:

 class SIGTERMException(SystemExit): 
     """Exception used when a :class:`signal.SIGTERM` is sent to a process. 
  
     This exception is raised by the loops at specific points. It can be used to write custom logic in the 
     :meth:`lightning.pytorch.callbacks.callback.Callback.on_exception` method. 
  
     For example, you could use the :class:`lightning.pytorch.callbacks.fault_tolerance.OnExceptionCheckpoint` callback 
     that saves a checkpoint for you when this exception is raised. 
  
     """
     code = 128 + 15  # see https://tldp.org/LDP/abs/html/exitcodes.html

What version are you seeing the problem on?

v2.0, v2.1, v2.2, master

How to reproduce the bug

Start a training and then send a keyboard interrupt signal to it, and run echo $? to see the exit code.