HSF / PyHEP.dev-workshops

PyHEP Developer workshops

Home Page:https://indico.cern.ch/e/PyHEP2023.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Virtual datasets via task graph serializaiton

nsmith- opened this issue · comments

If a task graph which transforms input data is available (as would be with uproot.lazy + dask-awkward), the following are interchangeable representations of a reduced output dataset for the purpose of caching:

  • Serialized task graph + input data
  • Serialized partial task graph + partially-reduced input data
  • Output dataset

What do we need to make a caching system that can optimize between these?

Yes - we do.

Dask distributed's dataset interface is an interesting thread to pull on here (when mixed with dask.persist and company).