Can't set random_state when doing a gridsearchCV

Question

Can't set random_state when doing a gridsearchCV

StijnBr opened this issue 2 years ago · comments

Dependencies

import numpy as np
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sktime.datasets import load_basic_motions
from sktime.transformations.panel.rocket import MiniRocketMultivariate`

Make train/test split and set up pipeline

X_train, y_train = load_basic_motions(split="train", return_X_y=True)

model = Pipeline([
    ('minirocket', MiniRocketMultivariate(random_state=42)), 
    ('ridge_clf', RidgeClassifier(random_state=42)),
])

Fit 1 model

model.fit(X_train, y_train)
Works fine

Now do a gridsearch for alpha value

parameters = {
  'ridge_clf__alpha': [0.1, 1, 10],
}

model_cv = GridSearchCV(model, parameters)

model_cv.fit(X_train, y_train)

"RuntimeError: Cannot clone object MiniRocketMultivariate(random_state=42), as the constructor either does not set or modifies parameter random_state"

angus924 · Answer 1 · Tue Apr 12 2022 10:04:41 GMT+0800 (China Standard Time)

Hi @StijnBr. Sorry, it has taken me a long time to return to this issue.

Are you still having trouble with this?

Currently we are limited in terms of what we can do with RandomState instances, as numba uses its own parallel random module. While it's more or less the same as the random module in sklearn, they are not interoperable. You can pass an integer in to the random state parameter, and that will seed the numba internal random state.

I believe the sktime implementation may have been modified recently to accept sklearn random state instances, but internally these will be ignored. I think this was done just to make the constructor conform to the standard pattern. The only valid random state input is an integer, i.e., an integer will be passed on and used to seed the internal numba random state, and anything else will just be ignored.

Having said that, using grid search is redundant for the RidgeClassifierCV, as it uses its own specialised internal routine to choose the alpha hyperparameter, you just need to specify a candidate range of values, e.g., model = RidgeClassifierCV(alphas = np.logspace(-3, 3, 10)). Using grid search will be much slower.

Stijn Brouwers · Answer 2 · Tue Apr 12 2022 14:26:49 GMT+0800 (China Standard Time)

Hi @angus924. No problem, thanks for your reply.

The issue is still not fixed for me. As you say, the random state instance is just ignored.

I am currently writing a paper in which MiniRocket is used and since the random state is ignored, the results are not 100% reproducible. The differences are minor however, so it's not that big of a deal. Since we compare different TSC algorithms, we want to have the same workflow for each test, which is why we use GridSearchCV instead of RidgeClassifierCV (and time wise this is not an issue).

angus924 · Answer 3 · Tue Apr 12 2022 20:34:49 GMT+0800 (China Standard Time)

Ok, I see.

Unfortunately, I don't have a great solution at present.

I can think of two potential workarounds: (1) change this line so that instead of ignoring a RandomState instance completely, you copy the seed out of the RandomState instance and pass it into the numba code as an integer seed; or (2) move np.random.randint(num_instances) from this line to _fit_biases(...).

For option (2), you'd need to generate an index in _fit(...), representing a set of randomly-sampled training examples (with replacement), and then pass that index to _fit_biases(...) as an additional parameter. This would mean that the random sampling is done outside the numba code, and the RandomState could be used "as is".