pydata / xarray

N-D labeled arrays and datasets in Python

Home Page:https://xarray.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't select time with str after converting the calendar to noleap

juliettelavoie opened this issue · comments

What happened?

I am trying to select a time period after converting the calendar to noleap and I receive the following error:

TypeError: '<' not supported between instances of 'cftime._cftime.DatetimeNoLeap' and 'str'

What did you expect to happen?

I expected the slicing to work as normal.

Minimal Complete Verifiable Example

import xarray as xr
import pandas as pd
import numpy as np
time=xr.cftime_range('2000-01-01','2006-01-01', freq='D', calendar='noleap')
temperature =  np.ones(len(time))
da = xr.DataArray(
    data=temperature,
    dims=["time"],
    coords=dict(time=time,),
)
out=da.sel(time=slice('2001','2002'))  # works if array is built with cftime
da1=da.convert_calendar('noleap')
out1=da1.sel(time=slice('2001','2002'))  # works if array is built with cftime and converted

time=pd.date_range('2000-01-01','2006-01-01', freq='D',)
temperature =  np.ones(len(time))
da2 = xr.DataArray(
    data=temperature,
    dims=["time"],
    coords=dict(time=time,),
)
out2=da2.sel(time=slice('2001','2002')) #  # works if array is built with pandas time
da3=da2.convert_calendar('noleap')
out3=da3.sel(time=slice('2001','2002')) # fails if array is built with pandas time and convert to noleap

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

KeyError                                  Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '2001'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side)
   6797 try:
-> 6798     slc = self.get_loc(label)
   6799 except KeyError as err:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.

KeyError: '2001'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[1], line 24
     22 out2=da2.sel(time=slice('2001','2002')) #  # works if array is built with pandas time
     23 da3=da2.convert_calendar('noleap')
---> 24 out3=da3.sel(time=slice('2001','2002')) # fails if array is built with pandas time and convert to noleap

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/dataarray.py:1644, in DataArray.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   1528 def sel(
   1529     self,
   1530     indexers: Mapping[Any, Any] | None = None,
   (...)
   1534     **indexers_kwargs: Any,
   1535 ) -> Self:
   1536     """Return a new DataArray whose data is given by selecting index
   1537     labels along the specified dimension(s).
   1538 
   (...)
   1642     Dimensions without coordinates: points
   1643     """
-> 1644     ds = self._to_temp_dataset().sel(
   1645         indexers=indexers,
   1646         drop=drop,
   1647         method=method,
   1648         tolerance=tolerance,
   1649         **indexers_kwargs,
   1650     )
   1651     return self._from_temp_dataset(ds)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/dataset.py:3126, in Dataset.sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   3058 """Returns a new dataset with each array indexed by tick labels
   3059 along the specified dimension(s).
   3060 
   (...)
   3123 
   3124 """
   3125 indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 3126 query_results = map_index_queries(
   3127     self, indexers=indexers, method=method, tolerance=tolerance
   3128 )
   3130 if drop:
   3131     no_scalar_variables = {}

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/indexing.py:192, in map_index_queries(obj, indexers, method, tolerance, **indexers_kwargs)
    190         results.append(IndexSelResult(labels))
    191     else:
--> 192         results.append(index.sel(labels, **options))
    194 merged = merge_sel_results(results)
    196 # drop dimension coordinates found in dimension indexers
    197 # (also drop multi-index if any)
    198 # (.sel() already ensures alignment)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/indexes.py:758, in PandasIndex.sel(self, labels, method, tolerance)
    755 coord_name, label = next(iter(labels.items()))
    757 if isinstance(label, slice):
--> 758     indexer = _query_slice(self.index, label, coord_name, method, tolerance)
    759 elif is_dict_like(label):
    760     raise ValueError(
    761         "cannot use a dict-like object for selection on "
    762         "a dimension that does not have a MultiIndex"
    763     )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/xarray/core/indexes.py:497, in _query_slice(index, label, coord_name, method, tolerance)
    493 if method is not None or tolerance is not None:
    494     raise NotImplementedError(
    495         "cannot use ``method`` argument if any indexers are slice objects"
    496     )
--> 497 indexer = index.slice_indexer(
    498     _sanitize_slice_element(label.start),
    499     _sanitize_slice_element(label.stop),
    500     _sanitize_slice_element(label.step),
    501 )
    502 if not isinstance(indexer, slice):
    503     # unlike pandas, in xarray we never want to silently convert a
    504     # slice indexer into an array indexer
    505     raise KeyError(
    506         "cannot represent labeled-based slice indexer for coordinate "
    507         f"{coord_name!r} with a slice over integer positions; the index is "
    508         "unsorted or non-unique"
    509     )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step)
   6618 def slice_indexer(
   6619     self,
   6620     start: Hashable | None = None,
   6621     end: Hashable | None = None,
   6622     step: int | None = None,
   6623 ) -> slice:
   6624     """
   6625     Compute the slice indexer for input labels and step.
   6626 
   (...)
   6660     slice(1, 3, None)
   6661     """
-> 6662     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6664     # return a slice
   6665     if not is_scalar(start_slice):

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step)
   6877 start_slice = None
   6878 if start is not None:
-> 6879     start_slice = self.get_slice_bound(start, "left")
   6880 if start_slice is None:
   6881     start_slice = 0

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/indexes/base.py:6801, in Index.get_slice_bound(self, label, side)
   6799 except KeyError as err:
   6800     try:
-> 6801         return self._searchsorted_monotonic(label, side)
   6802     except ValueError:
   6803         # raise the original KeyError
   6804         raise err

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/indexes/base.py:6733, in Index._searchsorted_monotonic(self, label, side)
   6731 def _searchsorted_monotonic(self, label, side: Literal["left", "right"] = "left"):
   6732     if self.is_monotonic_increasing:
-> 6733         return self.searchsorted(label, side=side)
   6734     elif self.is_monotonic_decreasing:
   6735         # np.searchsorted expects ascending sort order, have to reverse
   6736         # everything for it to work (element ordering, search side and
   6737         # resulting value).
   6738         pos = self[::-1].searchsorted(
   6739             label, side="right" if side == "left" else "left"
   6740         )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/base.py:1352, in IndexOpsMixin.searchsorted(self, value, side, sorter)
   1348 if not isinstance(values, np.ndarray):
   1349     # Going through EA.searchsorted directly improves performance GH#38083
   1350     return values.searchsorted(value, side=side, sorter=sorter)
-> 1352 return algorithms.searchsorted(
   1353     values,
   1354     value,
   1355     side=side,
   1356     sorter=sorter,
   1357 )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/core/algorithms.py:1329, in searchsorted(arr, value, side, sorter)
   1325     arr = ensure_wrapped_if_datetimelike(arr)
   1327 # Argument 1 to "searchsorted" of "ndarray" has incompatible type
   1328 # "Union[NumpyValueArrayLike, ExtensionArray]"; expected "NumpyValueArrayLike"
-> 1329 return arr.searchsorted(value, side=side, sorter=sorter)

TypeError: '<' not supported between instances of 'cftime._cftime.DatetimeNoLeap' and 'str'

Anything else we need to know?

It was working with version 2024.05.0

Environment

INSTALLED VERSIONS

commit: None
python: 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:34:09) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.105.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: ('fr_CA', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2
xarray: 2024.6.0
pandas: 2.2.1
numpy: 1.24.4
scipy: 1.11.2
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.9.0
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.3.7
dask: 2023.9.2
distributed: 2023.9.2
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: 0.13.1
numbagg: None
fsspec: 2023.9.2
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.7.2
numpy_groupies: 0.10.1
setuptools: 68.2.2
pip: 23.2.1
conda: installed
pytest: 7.4.2
mypy: None
IPython: 8.15.0
sphinx: 7.2.6

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

Thanks for the well-written report @juliettelavoie—a git bisect suggests this issue was introduced in #9002. I'll try and dig into it more when I get a chance.

@hmaarrfk do you have time to take a look here?

I think this is maybe the crux of the issue. convert_calendar initially assigns an array of times that includes NaNs to the time coordinate, which cannot be converted to a CFTimeIndex. In a subsequent step it drops those values. Previously after the drop the index would automatically be converted to a CFTimeIndex, but now it is not:

>>> import cftime; import numpy as np; import xarray as xr
>>> times_with_nan = [cftime.DatetimeNoLeap(2000, 1, 1), np.nan]
>>> da = xr.DataArray(range(2), dims=["time"], coords=[times_with_nan])
>>> da.indexes
Indexes:
    time     Index([2000-01-01 00:00:00, nan], dtype='object', name='time')
>>> da = da.where(da.time.notnull(), drop=True)
>>> da.indexes
Indexes:
    time     Index([2000-01-01 00:00:00], dtype='object', name='time')

It does seem maybe a bit surprising that this worked before. A workaround within convert_calendar would be to reassign the time coordinate after dropping the NaN values, which causes xarray to re-determine whether it can be converted to a CFTimeIndex:

>>> da["time"] = da.time
>>> da.indexes
Indexes:
    time     CFTimeIndex([2000-01-01 00:00:00],
            dtype='object', length=1, calendar='noleap', freq=None)

Or maybe clearer in this context would be to explicitly cast the index after the drop as a CFTimeIndex, which would raise instead of potentially silently letting a non-convertible index pass through:

>>> da["time"] = xr.CFTimeIndex(da.indexes["time"])
>>> da.indexes
Indexes:
    time     CFTimeIndex([2000-01-01 00:00:00],
            dtype='object', length=1, calendar='noleap', freq=None)

sorry. i didn't reply i'll try to take a look (sometimes github mentions explode)

my first feeling is that given the quality of the reproducer (as in it is very high) one could conceive of including it as a test and removing the offending optimization in #9002. While it does hurt me, I understand that this is likely the right thing to do avoid larger breakages.

Thanks @hmaarrfk looks like your PR may have only uncovered some implicit assumptions in the code base

No worries @hmaarrfk. It is somewhat odd that previously xarray would magically convert an Index to a CFTimeIndex once all NaNs were removed. I agree with @dcherian—this might just be something we need to fix within convert_calendar itself.