Patches for Lightning have not kept up with backwards-incompatible changes
a-gardner1 opened this issue · comments
Describe the bug
While there has clearly been some effort to keep pace with changes to Lightning (see #1033), it has fallen behind since the initial patches were created (64e10b2) and new versions of Lightning were released. Unfortunately, it silently fails to apply patches to model saving and restoration, which can hide the fact that model logging doesn't fully work as expected. One of the two related (and nearly duplicate) patch methods is shown below (linked here)
@staticmethod
def _patch_pytorch_lightning_io():
if PatchPyTorchModelIO.__patched_pytorch_lightning:
return
if 'pytorch_lightning' not in sys.modules:
return
PatchPyTorchModelIO.__patched_pytorch_lightning = True
# noinspection PyBroadException
try:
import pytorch_lightning # noqa
pytorch_lightning.trainer.Trainer.save_checkpoint = _patched_call(
pytorch_lightning.trainer.Trainer.save_checkpoint, PatchPyTorchModelIO._save) # noqa
pytorch_lightning.trainer.Trainer.restore = _patched_call(
pytorch_lightning.trainer.Trainer.restore, PatchPyTorchModelIO._load_from_obj) # noqa
except ImportError:
pass
except Exception:
pass
# noinspection PyBroadException
try:
import pytorch_lightning # noqa
# noinspection PyUnresolvedReferences
pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.save_checkpoint = \
_patched_call(
pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.save_checkpoint,
PatchPyTorchModelIO._save) # noqa
# noinspection PyUnresolvedReferences
pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.restore = \
_patched_call(
pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.restore,
PatchPyTorchModelIO._load_from_obj) # noqa
except ImportError:
pass
except Exception:
pass
Three AttributeErrors
exist in _patch_pytorch_lightning_io
with newer versions of pytorch-lightning
:
- In
pytorch-lightning-0.10.0
,Trainer.restore
was removed whenCheckpointConnector
was introduced and therestore
method was no longer inherited fromTrainerIOMixin
(Lightning-AI/pytorch-lightning@4724cdf) - In
pytorch-lightning-2.0.0
,CheckpointConnector
was renamed to_CheckpointConnector
(Lightning-AI/pytorch-lightning#17008) - In
pytorch-lightning-2.1.0
,_CheckpointConnector.save_checkpoint
was removed and inlined intoTrainer
(Lightning-AI/pytorch-lightning#17408 (comment))
To reproduce
No reproduction is necessary. There are multiple clear AttributeError
s that get caught by the Exception
handler depending on the pytorch-lightning
version.
Expected behaviour
The checkpointing mechanism of pytorch-lightning
should have been patched to enable automatic logging of models with ClearML.
Environment
- Server type: self hosted
- ClearML SDK Version: 1.15.1
- ClearML Server Version (Only for self hosted): WebApp: 1.11.0-000 • Server: 1.12.0- • API: 2.26
- Python Version: 3.11
- OS: Linux
Thanks for letting us know @a-gardner1. We'll take a look and update on fix availability.
I can open a PR with a proposed fix if you like. I've already implemented one
Contributions are most welcome @a-gardner1 🙂