scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.

Home Page:https://uproot.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with parallelization for 5.2.2

torresramiro350 opened this issue · comments

In uproot 5.1.2, I was able to parallelize the reading of ROOT files with the following example below. However, with the new release uproot 5.2.2, I get this error:

AttributeError("'FSSpecSource' object has no attribute '_fh'")
import numpy as np
import uproot
import yaml
from numpy.typing import NDArray

from dask.base import compute
from dask.delayed import delayed
from dask.distributed import Client

cli = Client(n_workers=5)

@delayed
def get_array( bin_id: str, inmap: uproot.ReadOnlyDirectory) -> tuple[str, NDArray[np.float64]]:
    outdata_bin = inmap[f"nHit{bin_id.zfill(2)}/data/count"].array().to_numpy()
    return bin_id, outdata_bin

@delayed
def get_bkg( bin_id: str, inmap: uproot.ReadOnlyDirectory) -> tuple[str, NDArray[np.float64]]:
    outdata_bin = inmap[f"nHit{bin_id.zfill(2)}/bkg/count"].array().to_numpy()
    return bin_id, outdata_bin

maptree="./maptree.root"

with uproot.open(maptree) as inmap:
    analysis_bin_names = inmap["BinInfo/name"].array().to_numpy().astype(str)

    tasks = [get_array(bin_id, inmap) for bin_id in analysis_bin_names]
    tasks_bkg = [get_bkg(bin_id, inmap) for bin_id in analysis_bin_names]

    counts_info = compute(*tasks)
    bkg_info = compute(*tasks_bkg)
    data = dict(counts_info)
    bkg = dict(bkg_info)
cli.close()

An example of the file can be retrieved here: https://data.hawc-observatory.org/datasets/geminga2017/geminga2017-download/maptree.root

Additionally, now with the new release I try to parallelize reading larger of +1GB ROOT files and what it used to take < 1 minute now runs for +3 minutes.

One way we can get AttributeErrors when introducing a new default like FSSpecSource is by expecting it on all Sources but the new Source doesn't have it. That's not what happened here: _fh is only on FSSpecSource, so it's an implementation detail of that particular class. For the attribute to be missing, there must be multiple ways to create the class, and the case that broke for you, @torresramiro350, is one that took an alternative that failed to make the attribute.

Reading the code, it looks like _fh is a transient file handle, and it's allowed to be None. (That's its initial state in __init__.) In fact, it looks like it's missing from __setstate__, which reconstitutes an object during unpickling:

def __getstate__(self):
self._fh = None
state = dict(self.__dict__)
state.pop("_executor")
state.pop("_file")
state.pop("_fh")
return state
def __setstate__(self, state):
self.__dict__ = state
self._open()

Since _fh and _file are both introduced with an initial value of None by __init__ (just before _open), it seems like they ought to be reintroduced by __setstate__, so I did that in #1118.

I have not tested it, so let me know, @torresramiro350, if that's right.

@lobis, is this correct? I'll add you as a reviewer to the PR so that you can weigh in there, but do you see anything else that might be a problem here?

Introducing self._fh = None in __setstate__ does seem to fix the issue.

When @lobis approves it, I'll merge it. Thanks for testing!

I also tried running the example above with uproot 5.2.2 and it seems to take considerably longer to read a ROOT file of ~ 1GB than in version 5.1.2. I'll try to generate a file with fake numbers that can somehow help to diagnose it further.

I realize that will have to be on another issue. Thanks for the help!

That should be a different issue. I've been hearing a few things about local file access being slower; maybe we should revert to MemoryMappedSource for local files and use fsspec only for remote files...

I linked this issue to the PR after merging the PR. Closing the issue manually...