Retrievals in combination with xarray depend on the used cluster

Question

Retrievals in combination with xarray depend on the used cluster

observingClouds opened this issue 2 years ago · comments

xarray seems to request different amounts of files concurrently depending on the cluster configuration:

levante interactive

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.1.0|0.1.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.0.0|1.0.1"}}]}'
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"1.1.0|1.1.1"}}]}'

levante compute

import xarray as xr
ds=xr.open_mfdataset("slk:///arch/mh0010/m300408/showcase/dataset.zarr", engine="zarr")
ds.air.mean().compute()
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/0.1.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.0.1
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.0
/scratch/m/m300408/arch/mh0010/m300408/showcase/dataset.zarr/air/1.1.1
slk search '{"$and":[{"path":{"$gte":"/arch/mh0010/m300408/showcase/dataset.zarr/air","$max_depth":1}},{"resources.name":{"$regex":"0.0.0|0.0.1|0.1.0|0.1.1|1.0.0|1.0.1|1.1.0|1.1.1"}}]}'

Martin Bergemann · Answer 1 · Thu Jan 12 2023 23:54:29 GMT+0800 (China Standard Time)

Again, this is related to my comment in #10 . I think

Hauke Schulz · Answer 2 · Sun Jan 15 2023 03:16:56 GMT+0800 (China Standard Time)

A few more insights. The different is likely caused by dask and how the dask task graph looks like. Depending on the available resources the task graph is created differently. If more resources are available then more data is requested.

#21 scans the task graph and gathers all open-dataset requests and is thereby independent of the available resources.

Hauke Schulz · Answer 3 · Mon Feb 06 2023 08:27:31 GMT+0800 (China Standard Time)

With #21 being merged the recommended way to retrieve files is to use the ds.slk.stage() command which operates independent of the available resources.

Hauke Schulz · Answer 4 · Thu Feb 23 2023 05:42:24 GMT+0800 (China Standard Time)

This issue unfortunately seems to remain. Dask still schedules the retrievals depending on the available resources to the cluster.