Design Model class that uses PandasDataset and encapsulates sklearn models.

Question

Design Model class that uses PandasDataset and encapsulates sklearn models.

macks22 opened this issue 9 years ago · comments

Given a model that conforms to the scikit-learn estimator interface and a Dataset with well-defined pre-processing and train/test splitting, we should be able to produce predictions for some or all of the tests sets.

More generally, given a model that conforms to a known interface and a Dataset such as described above, we should be able to predict for some or all test sets. We simply train then predict using the known model interface. Some kind of runner method can allow specification of which available test sets to predict for and models can be trained in parallel. Later, we could extend this idea for sequential/online learning and prediction by training on one train set, predicting on its test set, then using the previously learned model and simply updating on the next set before predicting on the next test set.

For now, to keep things simple, we should implement a simple Model class and then make a SklearnModel subclass in order to avoid over-engineering.

The Model class itself can reasonably be expected to deal with high-level operations such as:

save/load for model parameters
high-level runner method(s)
CLI parser method(s)

...while the SklearnModel should also specify the train/predict loop logic. This can be adapted from sklearn_model in the scaffold module.