pydata / xarray

N-D labeled arrays and datasets in Python

Home Page:https://xarray.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataArray.mean drops coordinates

derhintze opened this issue · comments

What happened?

Averaging the data variables along some dimension drops coordinates that also have that dimension.

What did you expect to happen?

I would expect that the coordinates aren't dropped, but averaged along said dimension, too.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

data = xr.DataArray(
    np.ones((3, 2)),
    dims=["dim0", "dim1"],
    coords={"foo": (("dim0", "dim1"), np.zeros((3, 2)))},
)

print(data.mean(dim="dim0"))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

<xarray.DataArray (dim1: 2)> Size: 16B
array([1., 1.])
Dimensions without coordinates: dim1

Anything else we need to know?

I had a look at #1470 and #3510, but those appear unrelated?

Environment

INSTALLED VERSIONS

commit: None
python: 3.9.7 (default, Jan 16 2024, 12:46:10)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development

xarray: 2024.5.0
pandas: 2.2.2
numpy: 1.26.2
scipy: 1.13.1
netCDF4: 1.6.4
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.9.0
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 57.4.0
pip: 21.2.3
conda: None
pytest: 8.2.2
mypy: 1.10.0
IPython: 8.16.1
sphinx: None

Can confirm that the output is the same with xarray 2024.6.0

I believe this may be intentional (I may be wrong, though): it is often not so useful to reduce the coordinates with the same operation as the data, and so xarray drops them instead.

If you really need this, you can convert them to data variables first using .reset_coords(names), do the reduction, then use .set_coords(names).

@keewis Thanks! I'm not sure if it's "often" not so useful, tho ;) Can't come up with a reasonable example from our field (2D sensor data processing), but I get the point. I did what you suggest as a work-around, but I had hoped for a better solution. A bit tedious. The thing is, coarsen indeed does mean coords by default. So also some contraption like

data.coarsen({"dim0": data.sizes["dim0"]}).mean(dim="dim0").squeeze()

would work. But reading this, imho, suggests that data.mean(dim="dim0") should do the same.. but well, that's subjective ;)

This is indeed intentional — the role of coordinates is to have things which aren't computed along. That's particularly the case when doing something like .lag — we don't want the coords lagging — but also the case with a reduction.

Are there times which xarray is inconsistent there? Is there an example of where something "should" be a coordinate but should also be reduced over?

Maybe we could add an option to the reductions that allows to change this behavior?
Something like data.mean(dim="dim0", coords="mean") with a default value of "drop".

But the workaround could be sufficient here.

@max-sixty

Are there times which xarray is inconsistent there?

Well, if you consider the behaviour I described above considering coarsen consistent with not reducing over coordinates, where coarsen does reduce over coords, then, no, not that I'm aware of. To be fair, though, it's documented that coarsen does average coords by default.

Is there an example of where something "should" be a coordinate but should also be reduced over?

That's a hard question, since it would depend on conventions of what people put into coords. We have time-series of 2D sensor images as data variables, where we want to do operations with, and then add coordinates containing metadata, like temperatures, time stamps, measurement-specific inputs like light-source wave-length or power. In all of those cases, when averaging over the time-series of 2D sensor data, we'd like to average the coordinates, too.

Granted, given there are work-arounds, and we can implement our own wrapping for this sort of stuff, it's not a big deal.

Yes, very reasonable @derhintze !

Good point around coarsen. I do think that's somewhat specific to coarsen, where it's applying a transformation to coords / labels. I agree it makes the separation a bit fuzzier.

I would vote to retain the behavior around coords — data.mean(dim="dim0", coords="mean") seems not much simpler than moving coords to vars and introduces more surface area to the API...

Closing as unlikely to inspire change, please reopen if anyone disagrees