HSF / PyHEP.dev-workshops

PyHEP Developer workshops

Home Page:https://indico.cern.ch/e/PyHEP2023.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data access and formats

jpivarski opened this issue · comments

Integrating big data services, disk and line formats, and analysis interface clients for streamlined input data flow.

I am planning to talk about offloading queries into storage systems to accelerate parquet access (skyhook), better protocols (RDMA based) to transport columnar data, and new query interfaces (Substrait.io) and query languages (Malloy) for easier data access.

+1 Looking forward to this discussion in general and perhaps receive some feedback and perspective on a hepfile project I've been working on with some students.

https://hepfile.readthedocs.io/en/latest/introduction.html

At FCC we store event data in EDM4hep format and I would like to discuss hurdles which occur with such minimalized data format at the analysis level.

Interested too, and one thing to throw in that bothers me since a long time as a "missing feature": how to modify a dataset, reasonably? Sure, we shouldn't. Sure, spacewise inefficient. Sure. But I think this is very often needed in any prototyping phase of the analysis where we would like to just modify, try again, modify (no matter if it takes more space etc) instead of copying everything over. Say in a ROOT file, how to replace a branch (yep, aware of technical challenges in terms of memory layout etc, but the need from the user side still stands)?

More general, how can I manage a) a single filename and b) being able to easily replace single columns (this can be multiple files hidden, or whatever, I just never came across a decent solution), even if it may be inefficient.

@kjvbrt you might be interested in CoffeaTeam/coffea#822

@jonas-eschle a fun place to dig into that one would be in the caching policies and analysis data lifecycles #20. Nominally one doesn't modify the original dataset you just add new columns that you can join back in efficiently (as in they are keyed the same way). You then also need to define how long this new data lives.

Column writeback has been something we've wanted for a long time! It's sort of possible with s3 and some other things, but doesn't quite have the semantics we'd like for users.

@mattbellis, not to offend, but you are aware of parquet, right? It is a very portable (multi-language) file format used throughout industry data science. I don't think you need to set your sights on hdf5 specifically, since parquet exists and it is already well adopted. Part of your project seems quite similar and using it may help you refine the specific user interface properties you want rather than fabricating a whole structured serialization system yourself. It is fully interoperable with awkward and pandas already and is easy use with dask(-awkward,-dataframe).

I am, of course, happy to shelve the discussion of this for the workshop.

@lgray Yup, aware of parquet and tbh, part of our approach is inertia from when I started playing with HDF5 before parquet was a thing. There's nothing stopping hepfile from being rewritten with parquet as the underlying file format (or ASCII text...or JSON...or whatever), particularly because we've tried to take care to think about the overall schema and the API, rather than just how to implement it in HDF5. HDF5 is fairly well adopted as well in other scientific fields so when we started really trying to implement this (2017-2018), it seemed as good a format as any. The maturity and ease-of-use of the h5py library was a big factor as well.

I'm not trying to start a flame war between file formats, lord knows. :) And I'm always interested in learning more. Hopefully we can chat more f2f at the workshop!

This might be an opportunity to discuss reliable data access as mentioned by @nsmith- this morning, in particular using XrootD (which seems to be causing quite some pain).