Issue with parallelization for 5.2.2
torresramiro350 opened this issue · comments
In uproot 5.1.2, I was able to parallelize the reading of ROOT files with the following example below. However, with the new release uproot 5.2.2
, I get this error:
AttributeError("'FSSpecSource' object has no attribute '_fh'")
import numpy as np
import uproot
import yaml
from numpy.typing import NDArray
from dask.base import compute
from dask.delayed import delayed
from dask.distributed import Client
cli = Client(n_workers=5)
@delayed
def get_array( bin_id: str, inmap: uproot.ReadOnlyDirectory) -> tuple[str, NDArray[np.float64]]:
outdata_bin = inmap[f"nHit{bin_id.zfill(2)}/data/count"].array().to_numpy()
return bin_id, outdata_bin
@delayed
def get_bkg( bin_id: str, inmap: uproot.ReadOnlyDirectory) -> tuple[str, NDArray[np.float64]]:
outdata_bin = inmap[f"nHit{bin_id.zfill(2)}/bkg/count"].array().to_numpy()
return bin_id, outdata_bin
maptree="./maptree.root"
with uproot.open(maptree) as inmap:
analysis_bin_names = inmap["BinInfo/name"].array().to_numpy().astype(str)
tasks = [get_array(bin_id, inmap) for bin_id in analysis_bin_names]
tasks_bkg = [get_bkg(bin_id, inmap) for bin_id in analysis_bin_names]
counts_info = compute(*tasks)
bkg_info = compute(*tasks_bkg)
data = dict(counts_info)
bkg = dict(bkg_info)
cli.close()
An example of the file can be retrieved here: https://data.hawc-observatory.org/datasets/geminga2017/geminga2017-download/maptree.root
Additionally, now with the new release I try to parallelize reading larger of +1GB ROOT files and what it used to take < 1 minute now runs for +3 minutes.
One way we can get AttributeErrors when introducing a new default like FSSpecSource is by expecting it on all Sources but the new Source doesn't have it. That's not what happened here: _fh
is only on FSSpecSource, so it's an implementation detail of that particular class. For the attribute to be missing, there must be multiple ways to create the class, and the case that broke for you, @torresramiro350, is one that took an alternative that failed to make the attribute.
Reading the code, it looks like _fh
is a transient file handle, and it's allowed to be None
. (That's its initial state in __init__
.) In fact, it looks like it's missing from __setstate__
, which reconstitutes an object during unpickling:
uproot5/src/uproot/source/fsspec.py
Lines 63 to 73 in 1cee960
Since _fh
and _file
are both introduced with an initial value of None
by __init__
(just before _open
), it seems like they ought to be reintroduced by __setstate__
, so I did that in #1118.
I have not tested it, so let me know, @torresramiro350, if that's right.
@lobis, is this correct? I'll add you as a reviewer to the PR so that you can weigh in there, but do you see anything else that might be a problem here?
Introducing self._fh = None
in __setstate__
does seem to fix the issue.
When @lobis approves it, I'll merge it. Thanks for testing!
I also tried running the example above with uproot 5.2.2 and it seems to take considerably longer to read a ROOT file of ~ 1GB than in version 5.1.2. I'll try to generate a file with fake numbers that can somehow help to diagnose it further.
I realize that will have to be on another issue. Thanks for the help!
That should be a different issue. I've been hearing a few things about local file access being slower; maybe we should revert to MemoryMappedSource for local files and use fsspec only for remote files...
I linked this issue to the PR after merging the PR. Closing the issue manually...