"Fit" method triggers "ValueError: output array is read-only" when using Dask
DaniJonesOcean opened this issue · comments
I am attempting to use pyXpcm to carry out unsupervised classification on data from the UK Met Office UKESM model. I am accessing this data on ocean.pangeo.io, so I am using Dask.
Here is some of the output of the "fit" method before it errors:
----- START OF CODE AND OUTPUT -----
gmm.fit(training, features=features_in_ds, dim=features_zdim)
> Start preprocessing for action 'fit'
> Preprocessing xarray dataset 'TEMP' as PCM feature 'temperature'
[<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((415646,), (46,))] X RAVELED with success
Output axis is in the input axis, not need to interpolate, simple intersection
[<class 'xarray.core.dataarray.DataArray'>, <class 'dask.array.core.Array'>, ((415646,), (46,))] X INTERPOLATED with success)`
(...)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing_this(self, da, dim, feature_name, action)
671 except ValueError:
672 if self._debug: print("\t\t Fail to scale.transform with copy, fall back on input copy")
--> 673 X.data = self._scaler[feature_name].transform(X.data.copy())
674 pass
675 except:
/srv/conda/envs/notebook/lib/python3.7/site-packages/sklearn/preprocessing/data.py in transform(self, X, copy)
767 else:
768 if self.with_mean:
--> 769 X -= self.mean_
770 if self.with_std:
771 X /= self.scale_
ValueError: output array is read-only
------- END OF CODE AND OUTPUT --------
Here is the full Jupyter notebook with output and errors.
For context, I was able to get the pyXpcm "Argo" example to work on ocean.pangeo.io.
Hi @DanJonesOcean
Could you indicate:
- the version of pyxpcm, xarray, dask and sk-learn you're using ?
- the output of
training['TEMP'].values.flags
?
Hi @gmaze. No, when I call the "fit" method without a dask.distributed client, there are no errors. So I guess it is a Dask-related issue.
Hi @gmaze. Sorry for the delay. I'm subscribed to this comment thread, but apparently it doesn't inform me when a comment is edited. If you add new comments instead, I get emailed about them.
Here are the versions:
pyxpcm 0.4.0
xarray 0.14.1
dask 2.9.0
sklearn 0.22
And the output you requested (from a session that does not use a Dask distributed cluster):
[1]: training['TEMP'].values.flags
[1]: C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
The output appears to be the same after I activate a Dask cluster. But I'm not confident enough in my Dask knowledge to say whether Dask would actually be called to deal with loading a small NetCDF file.
Hi @DanJonesOcean
Thanks for the output
I have problems trying to reproduce your error (that I already encountered, sporadically)
I suspect this has to do with the a max_nbytes parameter used by joblib, see here for instance:
scikit-learn/scikit-learn#5956 (comment)
Can you try with the dask_ml library in place of sklearn ?
(simply use the backend='dask_ml'
option when instantiating your PCM model)
Hi @gmaze. Thanks for your continued efforts on this!
I used this call for the "fit" method:
gmm = pcm(K=8, features=pcm_features, debug=True, backend='dask_ml' )
but it produces the following error:
ValueError Traceback (most recent call last)
<ipython-input-7-2ff715a86919> in <module>
----> 1 gmm.fit(training, features=features_in_ds, dim=features_zdim)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in fit(self, ds, features, dim)
856 with self._context('fit', self._context_args) :
857 # PRE-PROCESSING:
--> 858 X, sampling_dims = self.preprocessing(ds, features=features, dim=dim, action='fit')
859
860 # CLASSIFICATION-MODEL TRAINING:
/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing(self, ds, features, dim, action, mask)
782 dim=dim,
783 feature_name=feature_in_pcm,
--> 784 action=action)
785 xlabel = ["%s_%i"%(feature_in_pcm, i) for i in range(0, x.shape[1])]
786 if self._debug:
/srv/conda/envs/notebook/lib/python3.7/site-packages/pyxpcm/models.py in preprocessing_this(self, da, dim, feature_name, action)
694 # https://github.com/dask/dask-ml/issues/541
695 # https://github.com/dask/dask-ml/issues/542
--> 696 X.data = dask.array.from_array(X.data, chunks=X.shape)
697
698 if isinstance(X.data, dask.array.Array):
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask/array/core.py in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta)
2708 if isinstance(x, Array):
2709 raise ValueError(
-> 2710 "Array is already a dask array. Use 'asarray' or " "'rechunk' instead."
2711 )
2712 if isinstance(x, (list, tuple, memoryview) + np.ScalarType):
ValueError: Array is already a dask array. Use 'asarray' or 'rechunk' instead.
BTW, since pyxpcm isn't installed on ocean.pangeo.io, I'm using the command
! pip install pyxpcm
each time. So at present I'm not using a frozen version (though perhaps I should).
hi @DanJonesOcean
Indeed, could you try with the last dev version that you can install with:
!pip install git+git://github.com/obidam/pyxpcm.git
I'm happy to report that when I install with:
!pip install git+git://github.com/obidam/pyxpcm.git
and call the fit command as follows:
gmm.fit(training, features=features_in_ds, dim=features_zdim)
then the "fit" step finishes without error. Great! Thanks very much for your efforts.
I'll go ahead and close this issue.