pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.

Home Page:https://pangeo-forge.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: Region (...) does not align with Zarr chunks ().

ghislainp opened this issue · comments

I'm trying to merge netcdf file into a single Zarr.

The recipe is:

recipe = (
        beam.Create(pattern.items())
        | OpenURLWithFSSpec()
        | OpenWithXarray(file_type=pattern.file_type, xarray_open_kwargs={"decode_coords": "all"})
        | StoreToZarr(
            combine_dims=pattern.combine_dim_keys,
            target_root='.',
            store_name='out.zarr',
            #target_chunks=chunks,
        )
    )

and the pattern = pattern_from_file_sequence(ncfiles, 'time')

the structure of the ncfiles is:

<xarray.Dataset>
Dimensions:                             (y: 402, x: 462, time: 365, nv: 4)
Coordinates:
  * time                                (time) datetime64[ns] 2005-04-01 ... ...
  * x                                   (x) float64 -2.975e+06 ... 2.788e+06
  * y                                   (y) float64 2.625e+06 ... -2.388e+06
Dimensions without coordinates: nv
Data variables:
    lat                                 (y, x) float32 ...
    lon                                 (y, x) float32 ...
    bounds_lat                          (y, x, nv) float32 ...
    bounds_lon                          (y, x, nv) float32 ...
    spatial_ref                         int64 ...
    snow_status_wet_dry_19H_ASC_raw     (time, y, x) float32 ...
    snow_status_wet_dry_19H_ASC_filter  (time, y, x) float32 ...
    snow_status_wet_dry_19H_DSC_raw     (time, y, x) float32 ...
    snow_status_wet_dry_19H_DSC_filter  (time, y, x) float32 ...

I get the error: ValueError: Region (slice(0, 365, None), slice(None, None, None), slice(None, None, None)) does not align with Zarr chunks (402, 462).

It seems that StoreToZarr tries to use the 'time' dimension to merge variables that do not depend on time.
When I remove the variable lat, lon, bounds_lat, bounds_lon, it works fine.

How can I solve this problem ? I had not such a problem with XarrayZarrRecipe

Thanks for raising an issue @ghislainp. Any chance you can share the input list of netcdf files used to create the file pattern?

Sure, you can download the data from here: https://filesender.renater.fr/?s=download&token=17666c2e-d738-4447-b338-406315b08aae The link is valid for 2 weeks.

This is almost certainly due to the presence of coordinates in the data variables. I know there are other similar issues but I can't find them. Anything in the data variables with a time in the dims will trigger this error.

Thanks for the files @ghislainp. I moved them to a temp s3 bucket and added a transform to drop the offending dims/vars to get a working example. Hope this helps.

import apache_beam as beam
import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.transforms import OpenURLWithFSSpec, OpenWithXarray, StoreToZarr

year_list = [2004,2005,2006]

def make_url(time):
    return f"s3://carbonplan-scratch/pgf/melt-AMSRU-Antarctic-{time}-12km.nc"


concat_dim = ConcatDim("time", year_list)
pattern = FilePattern(make_url, concat_dim)

from pangeo_forge_recipes.transforms import Indexed, T

class DropDims(beam.PTransform):

    @staticmethod
    def _drop_dims(item: Indexed[T]) -> Indexed[T]:
        index, ds = item
        ds = ds.drop_dims('nv')
        ds = ds[['snow_status_wet_dry_19H_ASC_raw',
        'snow_status_wet_dry_19H_ASC_filter',
        'snow_status_wet_dry_19H_DSC_raw',
        'snow_status_wet_dry_19H_DSC_filter']]
        return index, ds

    def expand(self, pcoll: beam.PCollection) -> beam.PCollection:
        return pcoll | beam.Map(self._drop_dims)

recipe = (
        beam.Create(pattern.items())
        | OpenURLWithFSSpec()
        | OpenWithXarray(file_type=pattern.file_type, xarray_open_kwargs={"decode_coords": "all"})
        | DropDims()
        | StoreToZarr(
            combine_dims=pattern.combine_dim_keys,
            target_root='.',
            store_name='out.zarr',
        )
    )

with beam.Pipeline() as p:
    p | recipe
image

Thank you. I also obtained the same effect by removing the variables manually with NCO...
However, is there an elegant way to re-add the missing variables after the StoreToZarr ? I imagine a parallel flow ? I'm completely novice to Beam...

However, is there a way to improve StoreToZarr to recover the previous behavior of XarrayZarrRecipe that was dealing correctly with these variables not depending on the combine dim ?

I don't think you have to drop all these variables. Just move them to coords instead of data variables.

is there a way to improve StoreToZarr to recover the previous behavior of XarrayZarrRecipe that was dealing correctly with these variables not depending on the combine dim ?

Are you sure about that? In the previous version, would lon and lat gain a time dimension? It's pretty ambiguous how to handle the presence of these coordinate variables in each dataset fragment.

I think I also ran into the same issue over at LEAP.
I have not confirmed that it works in dataflow, but I think a minimal solution here could be to do something like:

...
| OpenWithXarray(xarray_open_kwargs={'preprocess':lambda ds: ds.set_coords(['list', 'of', 'offending', 'coords'])})
...

In these relatively simple cases I wonder if we can provide a much more helpful error message by catching the ValueError: Region ... does not align with Zarr chunks ... and performing a quick test:

  • Are there data_vars that do not include concat_dim?
    • If yes, give a more useful warning and a suggestion how to fix it.
    • If not, just raise the original exception.