`uproot.dask` fails for TTrees with duplicate TBranch names
jpivarski opened this issue · comments
Jim Pivarski commented
Posted by @nsmith- on Slack:
>>> import uproot, skhep_testdata
>>> tree = uproot.open(skhep_testdata.data_path("uproot-metadata-performance.root"))["Events"]
>>> seen = set()
>>> for x in tree.branches:
... if x.name in seen:
... print(x.name, x.typename)
... seen.add(x.name)
...
FatJet_btagDDBvLV2 float[]
FatJet_btagDDCvBV2 float[]
FatJet_btagDDCvLV2 float[]
FatJet_nBHadrons uint8_t[]
FatJet_nCHadrons uint8_t[]
SubJet_nBHadrons uint8_t[]
SubJet_nCHadrons uint8_t[]
>>> lazy = uproot.dask("uproot-metadata-performance.root:Events")
>>> lazy.FatJet_btagDDBvLV2
dask.awkward<FatJet, npartitions=1>
>>> materialized = lazy.FatJet_btagDDBvLV2.compute()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/mambaforge/lib/python3.10/site-packages/dask/base.py", line 375, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/jpivarski/mambaforge/lib/python3.10/site-packages/dask/base.py", line 661, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1232, in __call__
result, _ = self._call_impl(i, start, stop)
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1236, in _call_impl
return self.read_tree(
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1008, in read_tree
keys_for_buffer = self.form_mapping_info.keys_for_buffer_keys(
File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 899, in keys_for_buffer_keys
keys.add(self._form_key_to_key[form_key])
KeyError: 'None'
The following file/TTrees in scikit-hep-testdata have duplicate TBranch names:
- uproot-issue404.root:Event
- uproot-issue494.root:Event
- uproot-issue513.root:Delphes
- uproot-issue399.root:Event
- uproot-issue403.root:Event
- uproot-issue371.root:Event
- uproot-issue443.root:muonDataTree
- uproot-metadata-performance.root:Events
- uproot-issue468.root:Event
Take your pick!
Nicholas Smith commented
A workaround is to use the filter_branch
argument to uproot.dask
, e.g. with
seen = set()
def filter(branch):
if branch.name in seen:
print(f"Duplicate branch: {branch.name}")
return False
seen.add(branch.name)
return True