skops-dev / skops

skops is a Python library helping you share your scikit-learn based models and put them in production

Home Page:https://skops.readthedocs.io/en/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] Persistence of custom models

mbignotti opened this issue · comments

Hi!
Thanks a lot for the nice library!

I see that, in the documentation, you write:

At the moment, skops cannot persist arbitrary Python code. This means if you have custom functions (say, a custom function to be used with sklearn.preprocessing.FunctionTransformer), it will not work.

I totally understand it, since this is far from being trivial.

However, I'm wondering if you have any plans on this topic.

I often find myself writing either custom sklearn models, by inheriting from sklearn.base.BaseEstimator, or wrappers of existing sklearn models. This, of course, complicates the model persistence problem when you want to use the model in another environment, without installing the package where the source code lives.

One solution is to bundle the model with the source code, but I have never found an easy and clean way to do it.

At the moment, I'm taking advantage of the experimental register_pickle_by_value function provided by cloudpickle. It works for simple use cases, but I don't really know what it's doing and all the cases where it might break.

I would like to ask if you think that, in the future, skops might help in this case as well.

Thanks a lot!

Persisting arbitrary python code is slightly different than persisting a class whose domain is __main__. You can already use skops.io to persist those classes, but when you try to load the model, those classes need to be present in the __main__ domain. That means you need to have a script which first defines those classes, then load the model, which should be doable if you're the only user; but if you're the only user, you can also just use cloudpickle as you are already.

The cleanest way for you would be to put your custom estimators in a package, and then use your models from there. It would have the benefit of you not needing coudpickle, and if you ever need it, to also ship it to others in a neat way.

But if your requirement is to load the model somewhere that you can't install any packages, then that's tricky. We definitely plan to investigate how we could support this, but it's a rather tricky subject since it opens the door to all sorts of exploits.

But if your requirement is to load the model somewhere that you can't install any packages, then that's tricky. We definitely plan to investigate how we could support this, but it's a rather tricky subject since it opens the door to all sorts of exploits.

Unfortunately, at the moment, this is the requirement. Just to give a little bit of context, let me explain why I raised this issue.

The current workflow we follow is as follows:

  • The data scientist trains a model inside a project-related repository, where she/he might need to write her/his own custom model.
  • The model is then serialized with cloudpickle and packaged into a zip file.
  • Later on, the model is loaded inside a home-made inference engine, which might also be deployed on embedded devices.
  • We would like to avoid having to install the data scientist's owned repository inside the runtime, as it might contain a lot of stuff not related to the model and it might cause a dependency management nightmare (in some cases, we cannot use docker). Also, this repository might be different for each project.
  • That's why, at the moment, the data scientist has to call register_pickle_by_value(my_package) before saving the pickle, where my_package is the package where the data scientist defined the custom model. This allows us to load the pickle file inside the runtime without installing the package. But it's not very robust and might break in multiple ways.

For these reasons, I've started looking for better alternatives.

In any case, thanks a lot for the quick answer!!

What you describe seems like a very non-trivial issue.

  • We would like to avoid having to install the data scientist's owned repository inside the runtime, as it might contain a lot of stuff not related to the model and it might cause a dependency management nightmare

It would be nice if the models could be factored out into their own package with only minimal dependencies. This package could be hosted on a private package index which is accessible from the embedded device. That would result in extra work for the data scientist but maybe it's an option. However, in that case pickle would also be an option, as there isn't much benefit of using skops.

It would be nice if the models could be factored out into their own package with only minimal dependencies

Yes, we will probably start looking into this. The definitive solution would be getting rid of python with something like ONNX, but then it becomes very hard to use custom models if they are written in pure python.

Thanks again for the answers and for the clarification!

So in this case, you're basically trusting your data scientists and allowing them to run whatever code they want in your production environment. If that's the case, I guess it wold be useful for you to have a workflow where your DS would sign the pickle files, and then you check the signature before loading at least.

The skops.io module is not intended to support more than what pickle does really. You have a somewhat dangerous workflow ;) and it goes the opposite direction of what we're trying to go with this persistent format.

But I really appreciate you trying it out.

You have a somewhat dangerous workflow

We're at the early stages of development and our solution has been used only for demos so far, not real production environments. We're fully aware that the workflow is dangerous, and that's why we're looking for alternatives :)

If that's the case, I guess it wold be useful for you to have a workflow where your DS would sign the pickle files, and then you check the signature before loading at least.

This is loosely checked. But as long as the data scientist's code respects the input and the output, he can do whatever he wants in between. Still, that code that lives in between must be persisted somehow.

In any case, I appreciate a lot your suggestions!

In terms of handling dependencies in production and in your docker images / environments, it might be useful for you to have a look at how we do it: https://github.com/huggingface/api-inference-community/tree/main/docker_images/sklearn