observingClouds / slkspec

fsspec filesystem for stronglink tape archive

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Combination of retrievals is not working when using url chaining

observingClouds opened this issue · comments

import xarray as xr
import fsspec
m=fsspec.open_files(['simplecache::slk:///arch/mh0010/m300408/showcase/dataset*.nc'])
with m as f:
    xr.open_mfdataset(f, engine="h5netcdf")

Regarding your simplecache chain: Do you think it could somehow be possible to use sth different than a Levante Filesystem as SLK_CACHE?

slk retrieve takes a directory as an argument which makes it pretty hard, I guess.

I don't think that is possible right now. slk would ideally have a function to return byte ranges. Those one can pipe wherever one would like to: object store, file system, ...

@neumannd, this might be something of interest for you.

@observingClouds The way slk retrieve works does not allow to do such a thing. It copies complete files via some multi-stream technic. The whole StrongLink System always copies full files. A copy-file-part-support would probably be against their concept. I have a inofficial slk_helpers command which can retrieve files as well. This is very slow and meant to be used for retrievals of index files of packems. It is nothing which I would suggest to use for large files (above a few MB). It could be adapted for certain special purposes. However, copying file parts would also not be possible because the API allows only to copy full files.

There might be some alternatives in future. Maybe we could have a chat next week in the evening or in the beginning of 2023 on this (Hauke, Fabi, Flo, me, ggf. Martin if back).

Can you quickly explain what are you trying to achieve with:

import xarray as xr
import fsspec
m=fsspec.open_files(['simplecache::slk:///arch/mh0010/m300408/showcase/dataset*.nc'])
with m as f:
    xr.open_mfdataset(f, engine="h5netcdf")

I want to open all files that satisfy the format dataset*.nc and pipe them through additional protocols by url chaining. As a very simple example I use the additional simplecache protocol which creates additional copies of the files. This protocol does not make much sense here as we copy the files already to a local folder SLK_CACHE, but you can think of other protocols like zip, tar that help you to open e.g. compressed files on the fly.
fsspec.open_files opens all these files and returns a file object that can then be read by xr.open_mfdataset and return one merged dataset to the user.

Just as a comment: this worked in my initial implementation if you would like to see this in action.

Sorry, closing this for now as the MRE is working.