scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.

Home Page:https://uproot.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`uproot.dask` fails for TTrees with duplicate TBranch names

jpivarski opened this issue · comments

Posted by @nsmith- on Slack:

>>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-metadata-performance.root"))["Events"]
>>> seen = set()
>>> for x in tree.branches:
...     if x.name in seen:
...         print(x.name, x.typename)
...     seen.add(x.name)
... 
FatJet_btagDDBvLV2 float[]
FatJet_btagDDCvBV2 float[]
FatJet_btagDDCvLV2 float[]
FatJet_nBHadrons uint8_t[]
FatJet_nCHadrons uint8_t[]
SubJet_nBHadrons uint8_t[]
SubJet_nCHadrons uint8_t[]

>>> lazy = uproot.dask("uproot-metadata-performance.root:Events")
>>> lazy.FatJet_btagDDBvLV2
dask.awkward<FatJet, npartitions=1>

>>> materialized = lazy.FatJet_btagDDBvLV2.compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/mambaforge/lib/python3.10/site-packages/dask/base.py", line 375, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/jpivarski/mambaforge/lib/python3.10/site-packages/dask/base.py", line 661, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1232, in __call__
    result, _ = self._call_impl(i, start, stop)
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1236, in _call_impl
    return self.read_tree(
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1008, in read_tree
    keys_for_buffer = self.form_mapping_info.keys_for_buffer_keys(
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 899, in keys_for_buffer_keys
    keys.add(self._form_key_to_key[form_key])
KeyError: 'None'

The following file/TTrees in scikit-hep-testdata have duplicate TBranch names:

  • uproot-issue404.root:Event
  • uproot-issue494.root:Event
  • uproot-issue513.root:Delphes
  • uproot-issue399.root:Event
  • uproot-issue403.root:Event
  • uproot-issue371.root:Event
  • uproot-issue443.root:muonDataTree
  • uproot-metadata-performance.root:Events
  • uproot-issue468.root:Event

Take your pick!

A workaround is to use the filter_branch argument to uproot.dask, e.g. with

seen = set()

def filter(branch):
    if branch.name in seen:
        print(f"Duplicate branch: {branch.name}")
        return False
    seen.add(branch.name)
    return True