HSF / PyHEP.dev-workshops

PyHEP Developer workshops

Home Page:https://indico.cern.ch/e/PyHEP2023.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

python bindings for rntuple, implementation of "uproot-cpp"

lgray opened this issue · comments

Presently the pure python implementation of root-io uproot is an extremely effective tools connecting the root file format to the data-science and wider scientific python ecosystems.

However, uproot makes many heavy GIL-bound computations that quickly limit its scaling in multithreaded environments where we want multiple data streams to downstream processing code. This forbids interesting compute topologies like large thread-reentrant histogram filling and imposes the small tax of needing to spawn processes, each with their own python interpreter, (as opposed to threads sharing a single interpreter) to achieve parallel data processing.

Looking to the future: with RNTuple, Feather (which already has a python-bound C++ implementation for this reason), and other similar high-throughput formats, it seems prudent to develop a GIL-friendly python packages for these HEP specific data sources.

  • Achieving this would require a whole new implementation of uproot (perhaps focusing only on array io at first) with a cython or C(++) backend
  • RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

We should find people interested in pursuing and completing these critical tasks.

FYI, there's also the per-interpreter GIL being introduced in CPython 3.12. That would allow the launching of sub-interpreters each with their own GIL, but without creating separate processes. It doesn't have a Python API in 3.12, but there will be a PyPI package allowing this to be used from Python code. (The current draft of that module is at https://pypi.org/project/interpreters-3-12/)

Don't know if that changes anything here, but something to keep in mind.

Thanks - that's good to know, but we'll be needing to deal with people using the previous interpreters for quite some time (basically until numba supports python 3.12).

I've been in favor of a compiled-but-Python-friendly Uproot for some time, but it's always been too large of a task—this will require dedicated effort and coordination (because I'm assuming more than one developer).

Some questions to ask about such a thing:

  • Perhaps the compiled language should be Julia: UnROOT.jl already exists. Can its Python bindings be developed more?
  • For common use-cases, precompiled is better, and scientific-python/cookie gives us the options of Scikit-Build/pybind11 for C++ and maturin for Rust.
  • We also shouldn't disregard the possibility of doing it in Numba, since that can be partially compiled, partially not, and it has more affinity with Python types, as well as prior expertise among likely developers. In terms of JIT technology, it's no better or worse than the Julia option (it's all LLVM).

The main difference between these three options is what people you want to or are able to get together with this. Option 1 pulls Julia developers more into the Coffea world, option 2 is for people who like blank pages, starting from scratch1, and option 3 is for pulling it together quickly with the Python + Numba expertise that's already in this area.

Footnotes

  1. Unfortunately, I'm one of those people who likes to start things from scratch, and the Rust option appeals to me. But it's more important to pull together things that already have some momentum. If the end result of this is that the Python and Julia HEP tools get more interchangeable, that's probably the best long-term win.

Oh, I forgot one (or two) more bullet points:

  • Attempt to do TTree versus jumping right to RNTuple? Or maybe
  • Only cover NanoAOD-like TTree ("-like" means primitives and dynamically sized arrays of primitives) and RNTuple?

If UnROOT can drop the gil then we're mostly good FWIW.

@tamasgal and @Moelf (Jerry will be attending): we should learn more about the scope of UnROOT's reading (and writing?) capabilities—what data types does it cover?—and how easy it would be to use it in Python. Can we, for instance, read NanoAOD-like TTrees into Awkward Arrays, possibly through Arrow, in a process controlled by Python?

I talked this out a bit with @Moelf at CHEP and at zeroth order it seems possible but we both had a lot of questions about GIL-friendliness.

I still think by the time something was worked out, 3.12 will be out, probably 3.12 compatible numba will be out, and you might be able to solve this with current uproot + intepreters-3-12, without rewriting much of anything. Might at least be worth testing with intepreters-3-12 and a 3.12 beta now (assuming you could make an interesting test without numba & maybe numpy).

RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability, so I think we better just roll our own.


Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec). 1

  • uproot read RNTuple and then sending arrow batch to Julia is viable (requires 1x more allocation, no big deal if computing non-trivial)
  • UnROOT.jl doesn't have writing to .root files function and lacks infrastructure (for chunk, TKey allocation etc.) that uproot has.

From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem 2 3 that people can leverage if they do that.

Footnotes

  1. We already deal with complex RNTuple schema and nanoAOD converted by using ROOT

  2. https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/

  3. https://arrow.apache.org/docs/python/api/cuda.html

Yeah - just switching to parquet / feather after reading the files in is perfectly viable IMO.

It's just a familiarity thing (and a convenience thing), people love TBrowser.