CLASSIFICATION : Saved RF models are too big

Question

CLASSIFICATION : Saved RF models are too big

jiho opened this issue 6 years ago · comments

Jean-Olivier Irisson commented 6 years ago

From a recent test (zooscan_wp2) is seems that the saved models contain a lot of things (this one is 27GB) and therefore takes very long to read.

What is required is only:

the definition of the RF trees (a couple thousand splits)
the definition of the PCA projection space (a covariance matrix, of size = number of features => ~60x60)
We should investigate what to discard and what to save.

grololo06 · Answer 1 · Wed May 13 2020 00:40:16 GMT+0800 (China Standard Time)

There is a cheap solution which is to use "joblib" library compress option during save. https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html

Jean-Olivier Irisson · Answer 2 · Wed May 13 2020 15:31:47 GMT+0800 (China Standard Time)

The decompression would still take time (and may even have been activated already). There are questions whether the basic `RandomForestClassifier` object stores the training data in the object, which can be large and explains the size of the objects. It needs to be cut out and only the model stored. It's also possible that our way of training results in very large models. See scikit-learn/scikit-learn#6276 for a discussion.

grololo06 · Answer 3 · Thu Jun 17 2021 17:51:57 GMT+0800 (China Standard Time)

Worth a study or soon deprecated?

grololo06 · Answer 4 · Tue Sep 28 2021 17:28:00 GMT+0800 (China Standard Time)

Linked to a now gone function.