xarray_to_grib.expand_dims too aggressively tries to expand dimensions that are already represented?

Question

xarray_to_grib.expand_dims too aggressively tries to expand dimensions that are already represented?

milly-troller opened this issue a year ago · comments

What happened?

I'm trying to write an xarray dataset (that I believe to be already more or less canonically defined) to grib, but I crash on this

  File ".../lib/python3.10/site-packages/cfgrib/xarray_to_grib.py", line 272, in canonical_dataset_to_grib
    canonical_dataarray_to_grib(data_var, file, grib_keys=real_grib_keys, **kwargs)
  File "".../lib/python3.10/site-packages/cfgrib/xarray_to_grib.py", line 221, in canonical_dataarray_to_grib
    coords_names, data_var = expand_dims(data_var)
  File "".../lib/python3.10/site-packages/cfgrib/xarray_to_grib.py", line 171, in expand_dims
    data_var = data_var.expand_dims(coord_name)
  File "".../lib/python3.10/site-packages/xarray/core/dataarray.py", line 2535, in expand_dims
    ds = self._to_temp_dataset().expand_dims(dim, axis)
  File "".../lib/python3.10/site-packages/xarray/core/dataset.py", line 4243, in expand_dims
    raise ValueError(f"Dimension {d} already exists.
 ValueError: Dimension time already exists.")

Now, the issue seems to be, I do already define the dimension time, and either I don't understand the whole paradigm of defining the dataset to match the wanted output, or cfgrib is a bit too aggressive it adding dimensions that it expects, even if they're already represented, if the size of the dimension coordinate happens to be 1.

If I simply insert additional condition if coord_name not in data_var.dims: to the line before the expand_dims call, the grib is written and does seem to contain all the anticipated data.

Did I miss something, or is this the way to go?

What are the steps to reproduce the bug?

import numpy as np
from datetime import datetime, timedelta
from cfgrib.xarray_to_grib import to_grib


data = np.zeros((1,1,1,10,10))
latspace = np.linspace(0,10,10)
lonspace = np.linspace(0,10,10)
model_time = datetime(2001, 9, 11)
valid_time = datetime(2001, 9, 11)
step = timedelta(413)

data_dict = {'temperature':(('time', 'valid_time', 'step', 'latitude', 'longitude'), data)}
smallset = xr.Dataset(
    data_vars=data_dict,
    coords={
    'time': (['time'], np.atleast_1d(np.datetime64(model_time.replace(tzinfo=None),'ns'))),
    'valid_time': (['valid_time'], np.atleast_1d(np.datetime64(valid_time.replace(tzinfo=None),'ns'))),
    'step': (['step'], np.atleast_1d(np.timedelta64(step,'ns'))),
    'latitude': (['latitude'], latspace), 'longitude':(['longitude'], lonspace)
    })

print(smallset)
to_grib(smallset, 'lil.grib')

Will print what I believe to be a fairly simple and plausible xarray dataset, and then crash in attempt to save it.

Version

0.9.10.4

Platform (OS and architecture)

Python 3.10.12 on Debian

Relevant log output

No response

Accompanying data

No response

Organisation

No response

Iain Russell · Answer 1 · Fri Aug 11 2023 23:50:38 GMT+0800 (China Standard Time)

Hi @milly-troller,

I think you're right, I've tried this fix, and it makes sense to me. More importantly, all the tests still pass, along with your test case :)

So I'll add that correction to the code, along with a test. Many thanks for your contribution to tracking down the cause of the error and suggesting a solution!

Iain

Iain Russell · Answer 2 · Sat Aug 12 2023 00:01:49 GMT+0800 (China Standard Time)

I will add that there is no need to specify 'valid_time' here, as it is derived from 'time' and 'step'.

Iain Russell · Answer 3 · Sat Aug 12 2023 00:16:40 GMT+0800 (China Standard Time)

Having said that, the following code, adapted from the test suite, does work without this modification:

import xarray as xr
import cfgrib
import pandas as pd
import numpy as np
from cfgrib.xarray_to_grib import to_grib


coords = [
    pd.date_range("2018-01-01T00:00", "2018-01-02T12:00", periods=4),
    pd.timedelta_range(0, "12h", periods=2),
    [1000.0, 850.0, 500.0],
        np.linspace(90.0, -90.0, 5),
        np.linspace(0.0, 360.0, 6, endpoint=False),
    ]
da = xr.DataArray(
    np.zeros((4, 2, 3, 5, 6)),
    coords=coords,
    dims=["time", "step", "isobaricInhPa", "latitude", "longitude"],
)

ds = da.to_dataset(name="t")
print(ds)

to_grib(ds, 'test.grib')

I think I will need to dig deeper into it, but that will be next week!

Iain Russell · Answer 4 · Tue Aug 15 2023 00:32:17 GMT+0800 (China Standard Time)

I've pushed that change now. The significant difference between the two examples was that yours had a single-valued dimension, whereas the one in the tests had only multi-valued dimensions, and so was not triggering this bad behaviour. Thanks again @milly-troller for reporting the issue and suggesting the solution!