fsspec / kerchunk

Cloud-friendly access to archival data

Home Page:https://fsspec.github.io/kerchunk/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kerchunking zarr from OSN, bucket not found

rsignell opened this issue · comments

I must be doing something dumb here:

  • I'm trying to kerchunk an existing zarr dataset from OSN
  • I can succesfully open the zarr dataset from OSN with xarray
  • kerchunk is complaining about not finding the same data in the bucket! why?
import fsspec
import xarray as xr
import kerchunk.combine
import kerchunk.zarr

fs_read = fsspec.filesystem('s3', anon=True, skip_instance_cache=True, use_listings_cache=False,
                            client_kwargs={'endpoint_url': 'https://usgs.osn.mghpcc.org'})

zarr_dataset = 'genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr'

#this works:
ds = xr.open_dataset(fs_read.get_mapper(zarr_dataset), engine='zarr')
print(ds)

# this fails:
ref1 = kerchunk.zarr.single_zarr(fs_read.get_mapper(zarr_dataset), inline=0)

with

...
ReferenceNotReachable: Reference "MAPSTA/.zarray" failed to fetch target ['s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray']

but in fact that file exists:

fs_read.info('s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray')

produces:

fs_read.info('s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray')

{'ETag': '"5e26d87da53f93073033bf4c55634a29"',
 'LastModified': datetime.datetime(2024, 4, 8, 13, 50, 26, tzinfo=tzutc()),
 'size': 320,
 'name': 'genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'application/octet-stream'}

Notebook here: https://gist.github.com/rsignell/b6b5639afd130f4c3287c6d1a0cc265a

It seems kerchunk.zarr.single_zarr does not correctly use the storage options when you pass in a ready-made store. It does work like this, though:

storage_options = dict(anon=True, skip_instance_cache=True, use_listings_cache=False, client_kwargs={'endpoint_url': 'https://usgs.osn.mghpcc.org'})
ref1 = kerchunk.zarr.single_zarr("s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr", storage_options=storage_options, inline=0)

This is awesome @martindurant !