intake / intake

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Home Page:https://intake.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Intake 2.0.0: ValueError: storage_options passed with non-fsspec path

observingClouds opened this issue · comments

I am trying to open an intake catalog that previously (prior to intake release 2.0.0) did not cause any issues. I know that intake 2 is currently in beta and I could pin an older version of intake, but I just wanted to raise this issue. I couldn't find any documentation on whether this could still be a valid intake 2 catalog (and should be compatible) or has to be adapted.

>>> import intake
>>> cat = intake.open_catalog("https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/subcatalogs/conus404-catalog.yml")
>>> cat['conus404-hourly-osn']
sources:
  conus404-hourly-osn:
    args:
      consolidated: true
      storage_options:
        anon: true
        client_kwargs:
          endpoint_url: https://usgs.osn.mghpcc.org/
        requester_pays: false
      urlpath: s3://hytest/conus404/conus404_hourly.zarr
    description: 'CONUS404 Hydro Variable subset, 40 years of hourly values. These
      files were created wrfout model output files (see ScienceBase data release for
      more details: https://www.sciencebase.gov/catalog/item/6372cd09d34ed907bf6c6ab1).
      You can work with this data for free in any environment (there are no egress
      fees).'
    driver: intake_xarray.xzarr.ZarrSource
    metadata:
      catalog_dir: https://raw.githubusercontent.com/hytest-org/hytest/main/dataset_catalog/subcatalogs
>>> cat['conus404-hourly-osn'].to_dask()

With intake==2.0.0:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/intake_xarray/base.py", line 69, in to_dask
    return self.read_chunked()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/intake_xarray/base.py", line 44, in read_chunked
    self._load_metadata()
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/intake/source/base.py", line 84, in _load_metadata
    self._schema = self._get_schema()
                   ^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/intake_xarray/base.py", line 18, in _get_schema
    self._open_dataset()
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/intake_xarray/xzarr.py", line 46, in _open_dataset
    self._ds = xr.open_dataset(self.urlpath, **kw)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/xarray/backends/api.py", line 572, in open_dataset
    backend_ds = backend.open_dataset(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/xarray/backends/zarr.py", line 1011, in open_dataset
    store = ZarrStore.open_group(
            ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/xarray/backends/zarr.py", line 464, in open_group
    zarr_group = zarr.open_consolidated(store, **open_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/zarr/convenience.py", line 1334, in open_consolidated
    store = normalize_store_arg(
            ^^^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/zarr/storage.py", line 197, in normalize_store_arg
    return normalize_store(store, storage_options, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/haukeschulz/mambaforge/envs/test/lib/python3.12/site-packages/zarr/storage.py", line 169, in _normalize_store_arg_v2
    raise ValueError("storage_options passed with non-fsspec path")
ValueError: storage_options passed with non-fsspec path

Opening the dataset with xarray or zarr directly was not an issue:

xr.open_zarr("s3://hytest/conus404/conus404_hourly.zarr", storage_options={'anon':True, 'requester_pays':False, 'client_kwargs':{'endpoint_url':'https://usgs.osn.mghpcc.org'}})

#782 I believe fixes this - if you would confirm, I would appreciate it.

For reference, here is how you would build the entry in the new way:

import intake
data = intake.datatypes.Zarr("s3://hytest/conus404/conus404_hourly.zarr", storage_options={"anon": True, "endpoint_url": "https://usgs.osn.mghpcc.org/"}, metadata={"description": "CONUS404 Hydro Variable subset, 40 years of hourly values"})
reader = data.to_reader("xarray", consolidated=False)
cat = intake.readers.entry.Catalog()
cat["conus404-hourly-osn"] = reader
cat.to_yaml_file("cat.yaml")

producing

aliases:
  conus404-hourly-osn: conus404-hourly-osn
data:
  95ffa5d13fb47748:
    datatype: intake.readers.datatypes:Zarr
    kwargs:
      root: ''
      storage_options:
        anon: true
        endpoint_url: https://usgs.osn.mghpcc.org/
      url: s3://hytest/conus404/conus404_hourly.zarr
    metadata:
      description: CONUS404 Hydro Variable subset, 40 years of hourly values
    user_parameters: {}
entries:
  conus404-hourly-osn:
    kwargs:
      consolidated: false
      data: '{data(95ffa5d13fb47748)}'
    metadata:
      description: CONUS404 Hydro Variable subset, 40 years of hourly values
    output_instance: xarray:Dataset
    reader: intake.readers.readers:XArrayDatasetReader
    user_parameters: {}
metadata: {}
user_parameters: {}
version: 2

cc @rsignell-usgs

(the perfectly valid alternative, of course, is to pin intake<2.0)

Thanks @martindurant for the quick response and the fix. It works with the current HEAD!
I appreciate also your additional documentation.