allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page:https://clear.ml/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Patches for Lightning have not kept up with backwards-incompatible changes

a-gardner1 opened this issue · comments

Describe the bug

While there has clearly been some effort to keep pace with changes to Lightning (see #1033), it has fallen behind since the initial patches were created (64e10b2) and new versions of Lightning were released. Unfortunately, it silently fails to apply patches to model saving and restoration, which can hide the fact that model logging doesn't fully work as expected. One of the two related (and nearly duplicate) patch methods is shown below (linked here)

    @staticmethod
    def _patch_pytorch_lightning_io():
        if PatchPyTorchModelIO.__patched_pytorch_lightning:
            return

        if 'pytorch_lightning' not in sys.modules:
            return

        PatchPyTorchModelIO.__patched_pytorch_lightning = True

        # noinspection PyBroadException
        try:
            import pytorch_lightning  # noqa

            pytorch_lightning.trainer.Trainer.save_checkpoint = _patched_call(
                pytorch_lightning.trainer.Trainer.save_checkpoint, PatchPyTorchModelIO._save)  # noqa

            pytorch_lightning.trainer.Trainer.restore = _patched_call(
                pytorch_lightning.trainer.Trainer.restore, PatchPyTorchModelIO._load_from_obj)  # noqa
        except ImportError:
            pass
        except Exception:
            pass

        # noinspection PyBroadException
        try:
            import pytorch_lightning  # noqa

            # noinspection PyUnresolvedReferences
            pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.save_checkpoint = \
                _patched_call(
                    pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.save_checkpoint,
                    PatchPyTorchModelIO._save)  # noqa

            # noinspection PyUnresolvedReferences
            pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.restore = \
                _patched_call(
                    pytorch_lightning.trainer.connectors.checkpoint_connector.CheckpointConnector.restore,
                    PatchPyTorchModelIO._load_from_obj)  # noqa
        except ImportError:
            pass
        except Exception:
            pass

Three AttributeErrors exist in _patch_pytorch_lightning_io with newer versions of pytorch-lightning:

To reproduce

No reproduction is necessary. There are multiple clear AttributeErrors that get caught by the Exception handler depending on the pytorch-lightning version.

Expected behaviour

The checkpointing mechanism of pytorch-lightning should have been patched to enable automatic logging of models with ClearML.

Environment

  • Server type: self hosted
  • ClearML SDK Version: 1.15.1
  • ClearML Server Version (Only for self hosted): WebApp: 1.11.0-000 • Server: 1.12.0- • API: 2.26
  • Python Version: 3.11
  • OS: Linux

Thanks for letting us know @a-gardner1. We'll take a look and update on fix availability.

I can open a PR with a proposed fix if you like. I've already implemented one

Contributions are most welcome @a-gardner1 🙂