python bindings for rntuple, implementation of "uproot-cpp"

Question

python bindings for rntuple, implementation of "uproot-cpp"

lgray opened this issue a year ago · comments

Presently the pure python implementation of root-io uproot is an extremely effective tools connecting the root file format to the data-science and wider scientific python ecosystems.

However, uproot makes many heavy GIL-bound computations that quickly limit its scaling in multithreaded environments where we want multiple data streams to downstream processing code. This forbids interesting compute topologies like large thread-reentrant histogram filling and imposes the small tax of needing to spawn processes, each with their own python interpreter, (as opposed to threads sharing a single interpreter) to achieve parallel data processing.

Looking to the future: with RNTuple, Feather (which already has a python-bound C++ implementation for this reason), and other similar high-throughput formats, it seems prudent to develop a GIL-friendly python packages for these HEP specific data sources.

Achieving this would require a whole new implementation of uproot (perhaps focusing only on array io at first) with a cython or C(++) backend
RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

We should find people interested in pursuing and completing these critical tasks.

Baidyanath Kundu commented a year ago

+1

Ianna Osborne commented a year ago

+1

Nikolai Hartmann commented a year ago

+1

Henry Schreiner · Answer 1 · Tue Jul 04 2023 22:17:58 GMT+0800 (China Standard Time)

FYI, there's also the per-interpreter GIL being introduced in CPython 3.12. That would allow the launching of sub-interpreters each with their own GIL, but without creating separate processes. It doesn't have a Python API in 3.12, but there will be a PyPI package allowing this to be used from Python code. (The current draft of that module is at https://pypi.org/project/interpreters-3-12/)

Don't know if that changes anything here, but something to keep in mind.

Lindsey Gray · Answer 2 · Tue Jul 04 2023 22:50:01 GMT+0800 (China Standard Time)

Thanks - that's good to know, but we'll be needing to deal with people using the previous interpreters for quite some time (basically until numba supports python 3.12).

Jim Pivarski · Answer 3 · Wed Jul 05 2023 02:28:17 GMT+0800 (China Standard Time)

I've been in favor of a compiled-but-Python-friendly Uproot for some time, but it's always been too large of a task—this will require dedicated effort and coordination (because I'm assuming more than one developer).

Some questions to ask about such a thing:

Perhaps the compiled language should be Julia: UnROOT.jl already exists. Can its Python bindings be developed more?
For common use-cases, precompiled is better, and scientific-python/cookie gives us the options of Scikit-Build/pybind11 for C++ and maturin for Rust.
We also shouldn't disregard the possibility of doing it in Numba, since that can be partially compiled, partially not, and it has more affinity with Python types, as well as prior expertise among likely developers. In terms of JIT technology, it's no better or worse than the Julia option (it's all LLVM).

The main difference between these three options is what people you want to or are able to get together with this. Option 1 pulls Julia developers more into the Coffea world, option 2 is for people who like blank pages, starting from scratch¹, and option 3 is for pulling it together quickly with the Python + Numba expertise that's already in this area.

Unfortunately, I'm one of those people who likes to start things from scratch, and the Rust option appeals to me. But it's more important to pull together things that already have some momentum. If the end result of this is that the Python and Julia HEP tools get more interchangeable, that's probably the best long-term win. ↩

Jim Pivarski · Answer 4 · Wed Jul 05 2023 05:52:32 GMT+0800 (China Standard Time)

Oh, I forgot one (or two) more bullet points:

Attempt to do TTree versus jumping right to RNTuple? Or maybe
Only cover NanoAOD-like TTree ("-like" means primitives and dynamically sized arrays of primitives) and RNTuple?

Lindsey Gray · Answer 5 · Wed Jul 05 2023 09:40:48 GMT+0800 (China Standard Time)

If UnROOT can drop the gil then we're mostly good FWIW.

Jim Pivarski · Answer 6 · Wed Jul 05 2023 22:52:49 GMT+0800 (China Standard Time)

@tamasgal and @Moelf (Jerry will be attending): we should learn more about the scope of UnROOT's reading (and writing?) capabilities—what data types does it cover?—and how easy it would be to use it in Python. Can we, for instance, read NanoAOD-like TTrees into Awkward Arrays, possibly through Arrow, in a process controlled by Python?

Lindsey Gray · Answer 7 · Wed Jul 05 2023 23:15:12 GMT+0800 (China Standard Time)

I talked this out a bit with @Moelf at CHEP and at zeroth order it seems possible but we both had a lot of questions about GIL-friendliness.

Henry Schreiner · Answer 8 · Wed Jul 05 2023 23:34:40 GMT+0800 (China Standard Time)

I still think by the time something was worked out, 3.12 will be out, probably 3.12 compatible numba will be out, and you might be able to solve this with current uproot + intepreters-3-12, without rewriting much of anything. Might at least be worth testing with intepreters-3-12 and a 3.12 beta now (assuming you could make an interesting test without numba & maybe numpy).

Jerry Ling · Answer 9 · Thu Jul 06 2023 00:09:01 GMT+0800 (China Standard Time)

RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings

From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability, so I think we better just roll our own.

Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec). ¹

uproot read RNTuple and then sending arrow batch to Julia is viable (requires 1x more allocation, no big deal if computing non-trivial)
UnROOT.jl doesn't have writing to .root files function and lacks infrastructure (for chunk, TKey allocation etc.) that uproot has.

From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem ² ³ that people can leverage if they do that.

We already deal with complex RNTuple schema and nanoAOD converted by using ROOT ↩
https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ ↩
https://arrow.apache.org/docs/python/api/cuda.html ↩

Lindsey Gray · Answer 10 · Thu Jul 06 2023 05:34:24 GMT+0800 (China Standard Time)

Yeah - just switching to parquet / feather after reading the files in is perfectly viable IMO.

It's just a familiarity thing (and a convenience thing), people love TBrowser.

Tamas Gal · Answer 11 · Fri Jul 07 2023 03:54:13 GMT+0800 (China Standard Time)

The traditional (read before-RNTuple) ROOT support in UnROOT.jl is mostly limited to primitive types, (multiply nested) std containers and a couple of extra streamer logic for the usual suspects. I am already in the planning to rewrite the core parser of UnROOT since currently everything is a bit too static. Julia has great metaprogramming features which would allow a much better design, so a next development iteration cycle is definitely due. Custom streamers need a bit too much care right now (unless the branch splitting is high enough). If I only had more time... ;) Just my two cents: while I recognise all the huge benefits of RNTuple, I guess the transition phase will be fairly long (my first rough guess is that it will exceed 5 years easily) and the support for TTree-based formats will be mandatory for a very long time. A tiny example in my environment is KM3NeT which will definitely not change the low-level dataformat and will stick to ROOT TTrees for the next 20+ years. We have much more freedom in higher level formats of course, where we also utilise HDF5 and Arrow-based ones ;) That being said, as Jerry emphasised, writing ROOT files will very likely become more and more obsolete downstreams. Back to the original question from Jim: I find the idea interesting to interface UnROOT via Python but I have very little experience with using Julia in the Python context. A couple of years ago I played around with PyCall.jl to reuse some of our Python libraries in Julia, which was a bit cumbersome due to clashes with Numba JITted functions. As far as I remember that was the biggest problem and a few Cython constructs. Things have evolved since then for sure. The other way around is of course a different story. Anyways, I'll try to free up some time and play around with Julia from within Python, but I am happy if someone else explores that as well.

…

On 5. Jul 2023, at 18:09, Jerry Ling ***@***.***> wrote: RNTuple is being implemented such that its core functionality can be built independent of root and made into python bindings From my limited personal experience around people, I don't think this is happening soon, and regardless, a ~librntupleio.so won't come with writing capability period, so I think we better just roll our own. Regarding what UnROOT.jl can deliver, technology-wise I am optimistic about covering ~100% reading (at least for the features currently exist in RNTuple Spec <https://github.com/root-project/root/blob/master/tree/ntuple/v7/doc/specifications.md>). uproot read and then hand arrow batch to Julia is viable (requires 1x more allocation, no big deal if computing heavy) UnROOT.jl doesn't have writing to .root files function and lacks infrastructure (for chunk, TKey allocation etc.) that uproot has. From the analysis-adjacent user perspective, once we move to RNTuple (which will be ~100% compatible with arrow logically speaking), I see small need for writing out to .root files if the output flows downstream, in fact there are huge amount of Arrow ecosystem 1 <x-msg://8/#user-content-fn-1-698ff080f2510a705cfc9782c9147dff> 2 <x-msg://8/#user-content-fn-2-698ff080f2510a705cfc9782c9147dff> that people can leverage if they do that. Footnotes https://arrow.apache.org/blog/2023/06/26/our-journey-at-f5-with-apache-arrow-part-2/ ↩ <x-msg://8/#user-content-fnref-1-698ff080f2510a705cfc9782c9147dff> https://arrow.apache.org/docs/python/api/cuda.html ↩ <x-msg://8/#user-content-fnref-2-698ff080f2510a705cfc9782c9147dff> — Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANGOLUEIQISYB3J4CM5VZDXOWGSRANCNFSM6AAAAAAZ5ZSM2U>. You are receiving this because you were mentioned.