kerchunking zarr from OSN, bucket not found

Question

kerchunking zarr from OSN, bucket not found

rsignell opened this issue 3 months ago · comments

I must be doing something dumb here:

I'm trying to kerchunk an existing zarr dataset from OSN
I can succesfully open the zarr dataset from OSN with xarray
kerchunk is complaining about not finding the same data in the bucket! why?

import fsspec
import xarray as xr
import kerchunk.combine
import kerchunk.zarr

fs_read = fsspec.filesystem('s3', anon=True, skip_instance_cache=True, use_listings_cache=False,
                            client_kwargs={'endpoint_url': 'https://usgs.osn.mghpcc.org'})

zarr_dataset = 'genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr'

#this works:
ds = xr.open_dataset(fs_read.get_mapper(zarr_dataset), engine='zarr')
print(ds)

# this fails:
ref1 = kerchunk.zarr.single_zarr(fs_read.get_mapper(zarr_dataset), inline=0)

with

...
ReferenceNotReachable: Reference "MAPSTA/.zarray" failed to fetch target ['s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray']

but in fact that file exists:

fs_read.info('s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray')

produces:

fs_read.info('s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray')

{'ETag': '"5e26d87da53f93073033bf4c55634a29"',
 'LastModified': datetime.datetime(2024, 4, 8, 13, 50, 26, tzinfo=tzutc()),
 'size': 320,
 'name': 'genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr/MAPSTA/.zarray',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'application/octet-stream'}

Notebook here: https://gist.github.com/rsignell/b6b5639afd130f4c3287c6d1a0cc265a

Martin Durant · Answer 1 · Sat May 04 2024 00:50:00 GMT+0800 (China Standard Time)

It seems kerchunk.zarr.single_zarr does not correctly use the storage options when you pass in a ready-made store. It does work like this, though:

storage_options = dict(anon=True, skip_instance_cache=True, use_listings_cache=False, client_kwargs={'endpoint_url': 'https://usgs.osn.mghpcc.org'})
ref1 = kerchunk.zarr.single_zarr("s3://genoatest/aloarca/hindcast_unstr_med_zarr_10d_15kn/WW3_medunstr_197901.zarr", storage_options=storage_options, inline=0)

rsignell · Answer 2 · Sat May 04 2024 02:52:49 GMT+0800 (China Standard Time)

This is awesome @martindurant !