CLASSIFICATION : Saved RF models are too big
jiho opened this issue · comments
From a recent test (zooscan_wp2
) is seems that the saved models contain a lot of things (this one is 27GB) and therefore takes very long to read.
What is required is only:
- the definition of the RF trees (a couple thousand splits)
- the definition of the PCA projection space (a covariance matrix, of size = number of features => ~60x60)
We should investigate what to discard and what to save.
There is a cheap solution which is to use "joblib" library compress option during save. https://joblib.readthedocs.io/en/latest/generated/joblib.dump.html
The decompression would still take time (and may even have been activated
already).
There are questions whether the basic `RandomForestClassifier` object
stores the training data in the object, which can be large and explains the
size of the objects. It needs to be cut out and only the model stored. It's
also possible that our way of training results in very large models. See
scikit-learn/scikit-learn#6276 for a discussion.
Worth a study or soon deprecated?
Linked to a now gone function.