"Fit" method triggers "ValueError: output array is read-only" when using Dask

Question

"Fit" method triggers "ValueError: output array is read-only" when using Dask

DaniJonesOcean opened this issue 5 years ago · comments

I am attempting to use pyXpcm to carry out unsupervised classification on data from the UK Met Office UKESM model. I am accessing this data on ocean.pangeo.io, so I am using Dask.

Here is some of the output of the "fit" method before it errors:

----- START OF CODE AND OUTPUT -----

gmm.fit(training, features=features_in_ds, dim=features_zdim)

> Start preprocessing for action 'fit'

> Preprocessing xarray dataset 'TEMP' as PCM feature 'temperature'
 [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((415646,), (46,))] X RAVELED with success
	Output axis is in the input axis, not need to interpolate, simple intersection
 [<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((415646,), (46,))] X INTERPOLATED with success)`

(...)

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing_this(self, da, dim, feature_name, action)
671 except ValueError:
672 if self._debug: print("\t\t Fail to scale.transform with copy, fall back on input copy")
--> 673 X.data = self._scaler[feature_name].transform(X.data.copy())
674 pass
675 except:

/srv/conda/envs/notebook/lib/python3.7/site-packages/sklearn/preprocessing/data.py in transform(self, X, copy)
767 else:
768 if self.with_mean:
--> 769 X -= self.mean_
770 if self.with_std:
771 X /= self.scale_

ValueError: output array is read-only

------- END OF CODE AND OUTPUT --------

Here is the full Jupyter notebook with output and errors.

For context, I was able to get the pyXpcm "Argo" example to work on ocean.pangeo.io.

Guillaume Maze · Answer 1 · Sun Jan 12 2020 18:57:58 GMT+0800 (China Standard Time)

Hi @DanJonesOcean
Could you indicate:

the version of pyxpcm, xarray, dask and sk-learn you're using ?
the output of training['TEMP'].values.flags ?

Dani Jones · Answer 2 · Mon Jan 13 2020 03:05:34 GMT+0800 (China Standard Time)

Hi @gmaze. No, when I call the "fit" method without a dask.distributed client, there are no errors. So I guess it is a Dask-related issue.

Dani Jones · Answer 3 · Wed Jan 15 2020 21:46:06 GMT+0800 (China Standard Time)

Hi @gmaze. Sorry for the delay. I'm subscribed to this comment thread, but apparently it doesn't inform me when a comment is edited. If you add new comments instead, I get emailed about them.

Here are the versions:

pyxpcm 0.4.0
xarray 0.14.1
dask 2.9.0
sklearn 0.22

And the output you requested (from a session that does not use a Dask distributed cluster):

[1]:  training['TEMP'].values.flags
[1]:  C_CONTIGUOUS : True
        F_CONTIGUOUS : False
        OWNDATA : False
        WRITEABLE : True
        ALIGNED : True
        WRITEBACKIFCOPY : False
        UPDATEIFCOPY : False

The output appears to be the same after I activate a Dask cluster. But I'm not confident enough in my Dask knowledge to say whether Dask would actually be called to deal with loading a small NetCDF file.

Guillaume Maze · Answer 4 · Tue Jan 21 2020 22:56:27 GMT+0800 (China Standard Time)

Hi @DanJonesOcean
Thanks for the output
I have problems trying to reproduce your error (that I already encountered, sporadically)
I suspect this has to do with the a max_nbytes parameter used by joblib, see here for instance:
scikit-learn/scikit-learn#5956 (comment)

Can you try with the dask_ml library in place of sklearn ?
(simply use the backend='dask_ml' option when instantiating your PCM model)

Dani Jones · Answer 5 · Wed Jan 22 2020 21:24:13 GMT+0800 (China Standard Time)

Hi @gmaze. Thanks for your continued efforts on this!

I used this call for the "fit" method:

gmm = pcm(K=8, features=pcm_features, debug=True, backend='dask_ml' )

but it produces the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-7-2ff715a86919> in <module>
----> 1 gmm.fit(training, features=features_in_ds, dim=features_zdim)

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in fit(self, ds, features, dim)
    856         with self._context('fit', self._context_args) :
    857             # PRE-PROCESSING:
--> 858             X, sampling_dims = self.preprocessing(ds, features=features, dim=dim, action='fit')
    859 
    860             # CLASSIFICATION-MODEL TRAINING:

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing(self, ds, features, dim, action, mask)
    782                                                                dim=dim,
    783                                                                feature_name=feature_in_pcm,
--> 784                                                                action=action)
    785                     xlabel = ["%s_%i"%(feature_in_pcm, i) for i in range(0, x.shape[1])]
    786                     if self._debug:

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing_this(self, da, dim, feature_name, action)
    694                         #   https://github.com/dask/dask-ml/issues/541
    695                         #   https://github.com/dask/dask-ml/issues/542
--> 696                         X.data = dask.array.from_array(X.data, chunks=X.shape)
    697 
    698                     if isinstance(X.data, dask.array.Array):

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta)
   2708     if isinstance(x, Array):
   2709         raise ValueError(
-> 2710             "Array is already a dask array. Use 'asarray' or " "'rechunk' instead."
   2711         )
   2712     if isinstance(x, (list, tuple, memoryview) + np.ScalarType):

ValueError: Array is already a dask array. Use 'asarray' or 'rechunk' instead.

BTW, since pyxpcm isn't installed on ocean.pangeo.io, I'm using the command

! pip install pyxpcm

each time. So at present I'm not using a frozen version (though perhaps I should).

Guillaume Maze · Answer 6 · Thu Jan 23 2020 16:08:05 GMT+0800 (China Standard Time)

hi @DanJonesOcean
Indeed, could you try with the last dev version that you can install with:

!pip install git+git://github.com/obidam/pyxpcm.git

Dani Jones · Answer 7 · Thu Jan 23 2020 17:55:58 GMT+0800 (China Standard Time)

I'm happy to report that when I install with:

!pip install git+git://github.com/obidam/pyxpcm.git

and call the fit command as follows:

gmm.fit(training, features=features_in_ds, dim=features_zdim)

then the "fit" step finishes without error. Great! Thanks very much for your efforts.

I'll go ahead and close this issue.