scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.

Home Page:https://uproot.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance degradation reading files via root://

chrisburr opened this issue · comments

Since the move to using fsspec-xrootd the performance of reading files via XRootD has dropped. For example, here is a script which pulls a few bytes of metadata from an LHCb file:

uproot-perf.py

#!/usr/bin/env python
import argparse
import time

import uproot
import uproot.source.xrootd


def run_test(pfn, handler):
    f = uproot.open(pfn, handler=handler)

    start = time.perf_counter()
    f["Refs"]["Params"].array()
    print("Took", time.perf_counter() - start, "seconds")


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("pfn")
    parser.add_argument("handler", choices=["default", "xrootd"])
    args = parser.parse_args()

    handler = {
        "default": None,
        "xrootd": uproot.source.xrootd.XRootDSource,
    }[args.handler]
    run_test(args.pfn, handler)


if __name__ == "__main__":
    parse_args()

# Use some LHCb open data as an example
$ export pfn='root://eospublic.cern.ch//eos/opendata/lhcb/Collision12/CHARM/LHCb_2012_Beam4000GeV_VeloClosed_MagDown_RealData_Reco14_Stripping21_CHARM_MDST/00041836/0000/00041836_00009718_1.charm.mdst'

# Using the default
$ ./uproot-perf.py $pfn default
Took 0.17654120799999995 seconds

# Using the old XRootDSource
$ ./uproot-perf.py $pfn xrootd                                                                                                                                                                    
Took 0.07812191700000004 seconds

The effect is less obvious for me with eospublic as I'm only a few milliseconds away from it. If I use data stored further away from me, I get 0.69s vs 2.7s.

I can also see the overhead when pulling larger amounts of data, though it's a smaller effect.

I suspect this is because "vector reads" are no longer being used so you have the roundtrip latency and overhead.

From a bit more debugging, it looks like it's reading way more data than it needs to, see the size: parts of below:

$ XRD_LOGLEVEL=Debug python /tmp/uproot-perf.py $pfn default 2>&1 | rg '(read command|Took )' -A 1
[2024-03-03 14:56:23.136216 +0100][Debug  ][File              ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:23.136351 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x6e06a20 (message: kXR_read (handle: 0x00000000, offset: 0, size: 5243283) ).
--
[2024-03-03 14:56:29.381762 +0100][Debug  ][File              ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:29.381919 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x4e12b50 (message: kXR_read (handle: 0x00000000, offset: 3872048242, size: 413) ).
--
[2024-03-03 14:56:29.673917 +0100][Debug  ][File              ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:29.674115 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x6d04340 (message: kXR_read (handle: 0x00000000, offset: 3867679172, size: 4369483) ).
--
[2024-03-03 14:56:43.445820 +0100][Debug  ][File              ] [0x14d5d98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=8DF1FEF1-9E06-419B-BE92-F5EF459EDA32] Sending a read command for handle 0x1 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:43.445890 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x6e06810 (message: kXR_read (handle: 0x01000000, offset: 3867678939, size: 168) ).
--
Took 14.426180082999053 seconds
[2024-03-03 14:56:44.102932 +0100][Debug  ][File              ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a close command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095



$ XRD_LOGLEVEL=Debug python /tmp/uproot-perf.py $pfn xrootd 2>&1 | rg '(read command|Took )' -A 1
[2024-03-03 14:56:59.208793 +0100][Debug  ][File              ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:59.208916 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x75054d0 (message: kXR_read (handle: 0x00000000, offset: 0, size: 403) ).
--
[2024-03-03 14:56:59.499091 +0100][Debug  ][File              ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:59.499166 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x75054d0 (message: kXR_read (handle: 0x00000000, offset: 3872048242, size: 313) ).
--
[2024-03-03 14:56:59.789551 +0100][Debug  ][File              ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:59.789725 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x14806d00 (message: kXR_read (handle: 0x00000000, offset: 3867679172, size: 681) ).
--
[2024-03-03 14:57:00.085071 +0100][Debug  ][File              ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a vector read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:57:00.085096 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x4f066c0 (message: kXR_readv (handle: 0x00000000, chunks: [(offset: 3867678939, size: 168); ], total size: 168) ).
--
Took 0.6477262080006767 seconds
[2024-03-03 14:57:00.440007 +0100][Debug  ][File              ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a close command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095

That's right, vector reads are no longer the default as a result of a study that @lobis did in his fellowship last fall. The change in behavior went in at the boundary between 5.2.x and 5.3.0, when FSSpecSource became the new default backend.

You can still get the other backends by explicitly setting the handler option (see uproot.open and similar). The original Source subclasses haven't been removed.

In @lobis's tests, sending and receiving independent requests and responses for each TBasket and handling them with Python async outperformed single request and streamed response. I can imagine that it depends on the server and the network, and that neither method would be the winner in all scenarios.

0.69s and 2.7s are both rather short times and likely don't reveal the asymptotic behavior... Hey, wait a minute! I remember our conversation back in Australia: you were motivated to introduce vector reads because you were working interactively with students who had small files, for which latency was the important metric. I can see how a 2.7s wait is considerably more noticeable than a 0.7s wait.

How would it be to pass handler=uproot.HTTPSource to every uproot.open call? Do we need a mechanism for global open-options, I just remembered that you can set

uproot.reading.open.defaults["handler"] = uproot.HTTPSource

to set this globally, once at the top of a script. Would that work?

I've done some very crude testing with a high latency connection (Australia <-> CERN) and it seems like the new fsspec implementation does well in comparision to the old XRootD source for lager amounts of data. It might even be faster if not for my next question.

Is it expected that the new fsspec based source reads more data than the old XRootD source in some cases? In #1157 (comment) it's reading over 6000x more data (9.16 MiB vs 1.53 KiB).

No, I expect it to read exactly the same amount of data. 6000× sounds like reading all TBranches when only a few are desired.

Does Uproot agree with XRD_LOGLEVEL=Debug in how many chunks (roughly, TBaskets) and bytes are requested?

self._num_requests += 1
self._num_requested_chunks += len(ranges)
self._num_requested_bytes += sum(stop - start for start, stop in ranges)

They can be accessed through uproot.Source.num_requested_chunks and uproot.Source.num_requested_bytes.

fsspec-xrootd is converting the requested ranges into bigger blocks: https://github.com/CoffeaTeam/fsspec-xrootd/blob/551692a472fcff2cdba3799ebc75670f0199ddf2/src/fsspec_xrootd/xrootd.py#L709-L742

If I force self.blocksize = 0 9.6 seconds becomes 1.6 seconds. This is still twice as long as XRootDSource but I haven't looked further yet.

self.blocksize also seems fishy, it seems to be added to the reads rather than chunking them (i.e. self.blocksize = 1 results in one extra byte being requested).

The rest of the difference is coming from the fsspec source opening the file twice:

[2024-03-05 23:00:45.735325 +0100][Debug  ][ExDbgMsg          ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x15e5ffa0 (message: kXR_open (file: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) ).
[2024-03-05 23:00:50.783995 +0100][Debug  ][ExDbgMsg          ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x15f25490 (message: kXR_stat (path: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, flags: none) ).
FSSpecSource.chunk(start=0, stop=403)
[2024-03-05 23:00:51.071420 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x1781f720 (message: kXR_read (handle: 0x00000000, offset: 0, size: 403) ).
FSSpecSource.chunk(start=3872048242, stop=3872048555)
[2024-03-05 23:00:51.376154 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x1781f510 (message: kXR_read (handle: 0x00000000, offset: 3872048242, size: 313) ).
FSSpecSource.chunk(start=3867679172, stop=3867679853)
[2024-03-05 23:00:51.682223 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x15f25740 (message: kXR_read (handle: 0x00000000, offset: 3867679172, size: 681) ).
FSSpecSource.chunks(ranges=[(3867678939, 3867679107)])
[2024-03-05 23:00:51.995422 +0100][Debug  ][ExDbgMsg          ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x23f04350 (message: kXR_open (file: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, mode: 074, flags: kXR_open_read kXR_async kXR_retstat ) ).
[2024-03-05 23:00:52.588776 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x15f25490 (message: kXR_read (handle: 0x01000000, offset: 3867678939, size: 168) ).
[2024-03-05 23:00:52.890174 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x17826850 (message: kXR_close (handle: 0x01000000) ).
Took 7.651638916009688 seconds
[2024-03-05 23:00:53.264742 +0100][Debug  ][ExDbgMsg          ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x27b2c1c0 (message: kXR_close (handle: 0x00000000) ).

The second one is caused by FSSpecSource.chunks calling _cat_file:

self._fs._cat_file(self._file_path, start=start, end=stop)

which opens a new file behind the scenes for every call:

https://github.com/CoffeaTeam/fsspec-xrootd/blob/b12503eb852f82f0b6bf85e1df02b2b683c1f819/src/fsspec_xrootd/xrootd.py#L392-L399

When we were evaluating single-request, streamed versus many requests in flight in parallel, it was mostly with HTTP as a protocol (easier to set up instances for testing). The issues we were concerned about, overloading the server with too many requests versus funneling all of a multi-node server's output through a single serializer, are unrelated to this additional cost of the XRootD File object overhead.

That's connected to the reason the file is opened twice, right? Does a File need to access the beginning of a file before it can extract a fixed byte range from the middle?

That's connected to the reason the file is opened twice, right? Does a File need to access the beginning of a file before it can extract a fixed byte range from the middle?

No, it looks like this is just a limiation of the implementation in fsspec-xrootd.

Unlike POSIX-style file reading there isn't a concept of the current position within a file with XRootD. Instead every read() includes both the offset and size [1]. This effecitively makes the file handles stateless once opened so there could probably be a cache at the "fsspec filesystem" level so a single file handle can be used by any number of requests.

I think the overhead of opening the file comes from the combination of:

  • authenticating
  • redirecting to the server which actually has the file

What a coincidence, I've also been looking at this in the last few days.
Regarding the reading more bytes than necessary, I think there the best course of action is to set the self._fs.open(self._file_path, block_size=BETTER_DEFAULT) in

self._file = self._fs.open(self._file_path)

because in fsspec the default is 5MB:
https://github.com/fsspec/filesystem_spec/blob/8a542c8c555614ab91fe07ce27a2d3dd3fd6700f/fsspec/spec.py#L1577
We usually read through a File handle a few times (file header, then the TTree header, usually at the end) and the bulk of the reads (TBranch get all baskets) go through the FileSystem stateless interface (I'll say more on this in a bit), so that handle might deserve a smaller size.

So I've been running a skim benchmark where I access the first 50k events out of a few hundred files hosted throughout the US, comparing the new and old xrootd handlers, and write out the data for 96 columns from the input into local parquet files. I find a pretty sizable degradation, here in the report.performance_counters.num_requested_bytes / report.duration from the from_uproot dask report (c/o #1156):
image
As is apparent, there is a lot of variation by site in how well servers handle the large number (about 1000 per chunk) of requests. I added back the xrootd vector read support through the cat_ranges fsspec interface in #1162 and this improved the performance slightly:
image
On a single file hosted at FNAL I can more reproducibly see a difference, with a summary as follows:

Setup from_uproot report duration
Original xrootd handler 5s
FSSpec handler as of v5.3.2 13-15s (varies)
FSSpec handler with #1162 7.5s
FSSpec handler with #1162 and CoffeaTeam/fsspec-xrootd#54 (prevent multiple file opens) 5s

So we can eventually recover the old performance with known solutions. CoffeaTeam/fsspec-xrootd#54 will take some more thought on how to best cache file handles in a way that doesn't cause trouble. That said, all this is rather terrible performance. So I dumped the ranges passed to each source.chunks() call, and find there are 128 calls (96 branches plus some other metadata?) each with about 10-15 ranges (baskets?) to read, for a total of about 1500 individual ranges to fetch! I'm surprised that opening a file 1500 times in 10 seconds did not cause the server to block my client 😁.
What I notice is that each separate branch read, even in the case of a vector read, is fetching ranges close to each other in the file but in separate read calls. I tried concatenating all 1500 calls together and sorting them, but it did not noticeably change the read time. However, applying a merge algorithm:

ordered = sorted((start, stop) for ranges in all_ranges for start, stop in ranges)
better = ordered[:1]
for start, stop in ordered[1:]:
    if start - better[-1][1] <= max((stop - start) // 50, 100):
        better[-1] = better[-1][0], stop
    else:
        better.append((start, stop))

I get a list of ranges better that is only 129 entries long, and it fetches all the data in 1.4s even without a vector read. So I think the lesson here is to pre-fetch everything if we know what we need, as is the case in uproot.dask/coffea.

The test code, omitting the cluster creation using lpcjobqueue, is as follows:

import dask
import awkward as ak
import dask_awkward as dak

from coffea import dataset_tools

def analysis(events):
    dataset = events.metadata["dataset"]
    photonSelect = (
        (events.Photon.pt > 18)
        & (abs(events.Photon.eta) < 1.5)
        & (events.Photon.isScEtaEE | events.Photon.isScEtaEB)
        & (events.Photon.cutBased >= 1)
    )
    events = events[
        ak.any(photonSelect, axis=1)
    ]
    skim = ak.zip(
        {
            "Jets": events.Jet,
            "MET": events.MET,
            "Photon": events.Photon,
        },
        depth_limit=1,
    )
    
    skim_task = dak.to_parquet(
        skim,
        f"root://cmseos.fnal.gov//store/user/ncsmith/skimtest/{dataset}",
        compute=False,
    )
    return skim_task

out, report = dataset_tools.apply_to_fileset(
    analysis,
    dataset_tools.max_files(
        dataset_tools.max_chunks(samples_ready, 1),
        200,
    ),
    uproot_options={
        "allow_read_errors_with_report": True,
        # "handler": uproot.source.xrootd.XRootDSource,
    },
)

where samples_ready is a list of a few thousand files from 5 datasets.

Thanks for the analysis @nsmith-! 🍰

All of your conclusions match with what I found, with the bonus of numbers to back them up.

For the caching open files, I realised it's analogous to the connection pool used by each aiohttp session so I think it's definitely the right way to go.

I think there the best course of action is to set the self._fs.open(self._file_path, block_size=BETTER_DEFAULT)

Looking at the XRootD logs (#1157 (comment)) it seems like BETTER_DEFAULT would be zero as block_size=1 results in the reads being one byte longer (maybe this is a bug on the fsspec-xrootd side).

It was at first my understanding that block_size is supposed to set what granularity we fetch data at any position from the file, e.g. with it set to e.g. 1024 a read of bytes 100-1200 would fetch two blocks (0-1024, 1024-2048) and then cache them locally. But perhaps I'm wrong about that. From your logs it seems not to be doing that, but only adjusting how much is initially read from the head of the file.
The fsspec docs aren't very enlightening: https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=block_size#fsspec.spec.AbstractFileSystem.open says "some indication of buffering". In some implementations it is called the "read-ahead" size, which may make sense for sequential file access (read large chunks into local buffer a few times rather than reading small chunks many times). Perhaps then that is the case here (unfortunately, despite working with Scott on the initial implementation of fsspec-xrootd, I cannot recall myself what it is for), in which case a reasonable value for us may be 403 per

"begin_chunk_size": 403, # the smallest a ROOT file can be

(or better yet round to 1024.. this is a small number these days)

Ok I implemented a file handle cache in CoffeaTeam/fsspec-xrootd#54 and now the sites that previously had a lot of trouble with opening the file often are doing better.
image

On the uproot side, we can save one additional open if we default to the stateless form

if self._fh:
self._fh.seek(start)
data = self._fh.read(stop - start)
else:
data = self._fs.cat_file(self._file_path, start, stop)

@lobis is there is a reason you prefer a file handle here, maybe for reading local files?

Here's a summary of what I've attempted:
image

The legend entries correspond to progressively more patches to improve this situation:

Note that for all sites the performance steadily improves, but it may in some cases have been improved by just one of the patches. I don't think any of these patches are harmful in alternative cases.
In my opinion, once all of these are merged we can close this issue

All the updates are merged. I'll close this for now.