Performance degradation reading files via root://
chrisburr opened this issue · comments
Since the move to using fsspec-xrootd the performance of reading files via XRootD has dropped. For example, here is a script which pulls a few bytes of metadata from an LHCb file:
uproot-perf.py
#!/usr/bin/env python
import argparse
import time
import uproot
import uproot.source.xrootd
def run_test(pfn, handler):
f = uproot.open(pfn, handler=handler)
start = time.perf_counter()
f["Refs"]["Params"].array()
print("Took", time.perf_counter() - start, "seconds")
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("pfn")
parser.add_argument("handler", choices=["default", "xrootd"])
args = parser.parse_args()
handler = {
"default": None,
"xrootd": uproot.source.xrootd.XRootDSource,
}[args.handler]
run_test(args.pfn, handler)
if __name__ == "__main__":
parse_args()
# Use some LHCb open data as an example
$ export pfn='root://eospublic.cern.ch//eos/opendata/lhcb/Collision12/CHARM/LHCb_2012_Beam4000GeV_VeloClosed_MagDown_RealData_Reco14_Stripping21_CHARM_MDST/00041836/0000/00041836_00009718_1.charm.mdst'
# Using the default
$ ./uproot-perf.py $pfn default
Took 0.17654120799999995 seconds
# Using the old XRootDSource
$ ./uproot-perf.py $pfn xrootd
Took 0.07812191700000004 seconds
The effect is less obvious for me with eospublic as I'm only a few milliseconds away from it. If I use data stored further away from me, I get 0.69s vs 2.7s.
I can also see the overhead when pulling larger amounts of data, though it's a smaller effect.
I suspect this is because "vector reads" are no longer being used so you have the roundtrip latency and overhead.
From a bit more debugging, it looks like it's reading way more data than it needs to, see the size:
parts of below:
$ XRD_LOGLEVEL=Debug python /tmp/uproot-perf.py $pfn default 2>&1 | rg '(read command|Took )' -A 1
[2024-03-03 14:56:23.136216 +0100][Debug ][File ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:23.136351 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x6e06a20 (message: kXR_read (handle: 0x00000000, offset: 0, size: 5243283) ).
--
[2024-03-03 14:56:29.381762 +0100][Debug ][File ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:29.381919 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x4e12b50 (message: kXR_read (handle: 0x00000000, offset: 3872048242, size: 413) ).
--
[2024-03-03 14:56:29.673917 +0100][Debug ][File ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:29.674115 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x6d04340 (message: kXR_read (handle: 0x00000000, offset: 3867679172, size: 4369483) ).
--
[2024-03-03 14:56:43.445820 +0100][Debug ][File ] [0x14d5d98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=8DF1FEF1-9E06-419B-BE92-F5EF459EDA32] Sending a read command for handle 0x1 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:43.445890 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x6e06810 (message: kXR_read (handle: 0x01000000, offset: 3867678939, size: 168) ).
--
Took 14.426180082999053 seconds
[2024-03-03 14:56:44.102932 +0100][Debug ][File ] [0x25f98828@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=59BD3460-5F19-483C-ADC9-0CC973D9CB0C] Sending a close command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
$ XRD_LOGLEVEL=Debug python /tmp/uproot-perf.py $pfn xrootd 2>&1 | rg '(read command|Took )' -A 1
[2024-03-03 14:56:59.208793 +0100][Debug ][File ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:59.208916 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x75054d0 (message: kXR_read (handle: 0x00000000, offset: 0, size: 403) ).
--
[2024-03-03 14:56:59.499091 +0100][Debug ][File ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:59.499166 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x75054d0 (message: kXR_read (handle: 0x00000000, offset: 3872048242, size: 313) ).
--
[2024-03-03 14:56:59.789551 +0100][Debug ][File ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:56:59.789725 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x14806d00 (message: kXR_read (handle: 0x00000000, offset: 3867679172, size: 681) ).
--
[2024-03-03 14:57:00.085071 +0100][Debug ][File ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a vector read command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
[2024-03-03 14:57:00.085096 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x4f066c0 (message: kXR_readv (handle: 0x00000000, chunks: [(offset: 3867678939, size: 168); ], total size: 168) ).
--
Took 0.6477262080006767 seconds
[2024-03-03 14:57:00.440007 +0100][Debug ][File ] [0x6a2cb98@root://ccsrm.ihep.ac.cn:1094//dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst?xrdcl.requuid=4BD91220-0CC2-473F-A6C9-838F342FC691] Sending a close command for handle 0x0 to dpmlhcb01.ihep.ac.cn:1095
That's right, vector reads are no longer the default as a result of a study that @lobis did in his fellowship last fall. The change in behavior went in at the boundary between 5.2.x and 5.3.0, when FSSpecSource became the new default backend.
You can still get the other backends by explicitly setting the handler
option (see uproot.open and similar). The original Source
subclasses haven't been removed.
In @lobis's tests, sending and receiving independent requests and responses for each TBasket and handling them with Python async
outperformed single request and streamed response. I can imagine that it depends on the server and the network, and that neither method would be the winner in all scenarios.
0.69s and 2.7s are both rather short times and likely don't reveal the asymptotic behavior... Hey, wait a minute! I remember our conversation back in Australia: you were motivated to introduce vector reads because you were working interactively with students who had small files, for which latency was the important metric. I can see how a 2.7s wait is considerably more noticeable than a 0.7s wait.
How would it be to pass handler=uproot.HTTPSource
to every uproot.open
call? Do we need a mechanism for global open-options, I just remembered that you can set
uproot.reading.open.defaults["handler"] = uproot.HTTPSource
to set this globally, once at the top of a script. Would that work?
I've done some very crude testing with a high latency connection (Australia <-> CERN) and it seems like the new fsspec implementation does well in comparision to the old XRootD source for lager amounts of data. It might even be faster if not for my next question.
Is it expected that the new fsspec based source reads more data than the old XRootD source in some cases? In #1157 (comment) it's reading over 6000x more data (9.16 MiB vs 1.53 KiB).
No, I expect it to read exactly the same amount of data. 6000× sounds like reading all TBranches when only a few are desired.
Does Uproot agree with XRD_LOGLEVEL=Debug
in how many chunks (roughly, TBaskets) and bytes are requested?
uproot5/src/uproot/source/fsspec.py
Lines 137 to 139 in accc1ca
They can be accessed through uproot.Source.num_requested_chunks and uproot.Source.num_requested_bytes.
fsspec-xrootd
is converting the requested ranges into bigger blocks: https://github.com/CoffeaTeam/fsspec-xrootd/blob/551692a472fcff2cdba3799ebc75670f0199ddf2/src/fsspec_xrootd/xrootd.py#L709-L742
If I force self.blocksize = 0
9.6 seconds becomes 1.6 seconds. This is still twice as long as XRootDSource but I haven't looked further yet.
self.blocksize
also seems fishy, it seems to be added to the reads rather than chunking them (i.e. self.blocksize = 1
results in one extra byte being requested).
The rest of the difference is coming from the fsspec
source opening the file twice:
[2024-03-05 23:00:45.735325 +0100][Debug ][ExDbgMsg ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x15e5ffa0 (message: kXR_open (file: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, mode: 00, flags: kXR_open_read kXR_async kXR_retstat ) ).
[2024-03-05 23:00:50.783995 +0100][Debug ][ExDbgMsg ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x15f25490 (message: kXR_stat (path: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, flags: none) ).
FSSpecSource.chunk(start=0, stop=403)
[2024-03-05 23:00:51.071420 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x1781f720 (message: kXR_read (handle: 0x00000000, offset: 0, size: 403) ).
FSSpecSource.chunk(start=3872048242, stop=3872048555)
[2024-03-05 23:00:51.376154 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x1781f510 (message: kXR_read (handle: 0x00000000, offset: 3872048242, size: 313) ).
FSSpecSource.chunk(start=3867679172, stop=3867679853)
[2024-03-05 23:00:51.682223 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x15f25740 (message: kXR_read (handle: 0x00000000, offset: 3867679172, size: 681) ).
FSSpecSource.chunks(ranges=[(3867678939, 3867679107)])
[2024-03-05 23:00:51.995422 +0100][Debug ][ExDbgMsg ] [ccsrm.ihep.ac.cn:1094] MsgHandler created: 0x23f04350 (message: kXR_open (file: /dpm/ihep.ac.cn/home/lhcb/LHCb/Collision18/LEPTONIC.MDST/00092248/0000/00092248_00002347_1.leptonic.mdst, mode: 074, flags: kXR_open_read kXR_async kXR_retstat ) ).
[2024-03-05 23:00:52.588776 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x15f25490 (message: kXR_read (handle: 0x01000000, offset: 3867678939, size: 168) ).
[2024-03-05 23:00:52.890174 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x17826850 (message: kXR_close (handle: 0x01000000) ).
Took 7.651638916009688 seconds
[2024-03-05 23:00:53.264742 +0100][Debug ][ExDbgMsg ] [dpmlhcb01.ihep.ac.cn:1095] MsgHandler created: 0x27b2c1c0 (message: kXR_close (handle: 0x00000000) ).
The second one is caused by FSSpecSource.chunks
calling _cat_file
:
uproot5/src/uproot/source/fsspec.py
Line 164 in e47934f
which opens a new file behind the scenes for every call:
When we were evaluating single-request, streamed versus many requests in flight in parallel, it was mostly with HTTP as a protocol (easier to set up instances for testing). The issues we were concerned about, overloading the server with too many requests versus funneling all of a multi-node server's output through a single serializer, are unrelated to this additional cost of the XRootD File
object overhead.
That's connected to the reason the file is opened twice, right? Does a File
need to access the beginning of a file before it can extract a fixed byte range from the middle?
That's connected to the reason the file is opened twice, right? Does a File need to access the beginning of a file before it can extract a fixed byte range from the middle?
No, it looks like this is just a limiation of the implementation in fsspec-xrootd
.
Unlike POSIX-style file reading there isn't a concept of the current position within a file with XRootD. Instead every read()
includes both the offset and size [1]. This effecitively makes the file handles stateless once opened so there could probably be a cache at the "fsspec filesystem" level so a single file handle can be used by any number of requests.
I think the overhead of opening the file comes from the combination of:
- authenticating
- redirecting to the server which actually has the file
What a coincidence, I've also been looking at this in the last few days.
Regarding the reading more bytes than necessary, I think there the best course of action is to set the self._fs.open(self._file_path, block_size=BETTER_DEFAULT)
in
uproot5/src/uproot/source/fsspec.py
Line 55 in e47934f
because in fsspec the default is 5MB:
https://github.com/fsspec/filesystem_spec/blob/8a542c8c555614ab91fe07ce27a2d3dd3fd6700f/fsspec/spec.py#L1577
We usually read through a File handle a few times (file header, then the TTree header, usually at the end) and the bulk of the reads (TBranch get all baskets) go through the FileSystem stateless interface (I'll say more on this in a bit), so that handle might deserve a smaller size.
So I've been running a skim benchmark where I access the first 50k events out of a few hundred files hosted throughout the US, comparing the new and old xrootd handlers, and write out the data for 96 columns from the input into local parquet files. I find a pretty sizable degradation, here in the report.performance_counters.num_requested_bytes / report.duration
from the from_uproot
dask report (c/o #1156):
As is apparent, there is a lot of variation by site in how well servers handle the large number (about 1000 per chunk) of requests. I added back the xrootd vector read support through the cat_ranges
fsspec interface in #1162 and this improved the performance slightly:
On a single file hosted at FNAL I can more reproducibly see a difference, with a summary as follows:
Setup | from_uproot report duration |
---|---|
Original xrootd handler | 5s |
FSSpec handler as of v5.3.2 | 13-15s (varies) |
FSSpec handler with #1162 | 7.5s |
FSSpec handler with #1162 and CoffeaTeam/fsspec-xrootd#54 (prevent multiple file opens) | 5s |
So we can eventually recover the old performance with known solutions. CoffeaTeam/fsspec-xrootd#54 will take some more thought on how to best cache file handles in a way that doesn't cause trouble. That said, all this is rather terrible performance. So I dumped the ranges passed to each source.chunks()
call, and find there are 128 calls (96 branches plus some other metadata?) each with about 10-15 ranges (baskets?) to read, for a total of about 1500 individual ranges to fetch! I'm surprised that opening a file 1500 times in 10 seconds did not cause the server to block my client 😁.
What I notice is that each separate branch read, even in the case of a vector read, is fetching ranges close to each other in the file but in separate read calls. I tried concatenating all 1500 calls together and sorting them, but it did not noticeably change the read time. However, applying a merge algorithm:
ordered = sorted((start, stop) for ranges in all_ranges for start, stop in ranges)
better = ordered[:1]
for start, stop in ordered[1:]:
if start - better[-1][1] <= max((stop - start) // 50, 100):
better[-1] = better[-1][0], stop
else:
better.append((start, stop))
I get a list of ranges better
that is only 129 entries long, and it fetches all the data in 1.4s even without a vector read. So I think the lesson here is to pre-fetch everything if we know what we need, as is the case in uproot.dask/coffea.
The test code, omitting the cluster creation using lpcjobqueue, is as follows:
import dask
import awkward as ak
import dask_awkward as dak
from coffea import dataset_tools
def analysis(events):
dataset = events.metadata["dataset"]
photonSelect = (
(events.Photon.pt > 18)
& (abs(events.Photon.eta) < 1.5)
& (events.Photon.isScEtaEE | events.Photon.isScEtaEB)
& (events.Photon.cutBased >= 1)
)
events = events[
ak.any(photonSelect, axis=1)
]
skim = ak.zip(
{
"Jets": events.Jet,
"MET": events.MET,
"Photon": events.Photon,
},
depth_limit=1,
)
skim_task = dak.to_parquet(
skim,
f"root://cmseos.fnal.gov//store/user/ncsmith/skimtest/{dataset}",
compute=False,
)
return skim_task
out, report = dataset_tools.apply_to_fileset(
analysis,
dataset_tools.max_files(
dataset_tools.max_chunks(samples_ready, 1),
200,
),
uproot_options={
"allow_read_errors_with_report": True,
# "handler": uproot.source.xrootd.XRootDSource,
},
)
where samples_ready
is a list of a few thousand files from 5 datasets.
Thanks for the analysis @nsmith-! 🍰
All of your conclusions match with what I found, with the bonus of numbers to back them up.
For the caching open files, I realised it's analogous to the connection pool used by each aiohttp session so I think it's definitely the right way to go.
I think there the best course of action is to set the self._fs.open(self._file_path, block_size=BETTER_DEFAULT)
Looking at the XRootD logs (#1157 (comment)) it seems like BETTER_DEFAULT
would be zero as block_size=1
results in the reads being one byte longer (maybe this is a bug on the fsspec-xrootd
side).
It was at first my understanding that block_size
is supposed to set what granularity we fetch data at any position from the file, e.g. with it set to e.g. 1024
a read of bytes 100-1200 would fetch two blocks (0-1024, 1024-2048) and then cache them locally. But perhaps I'm wrong about that. From your logs it seems not to be doing that, but only adjusting how much is initially read from the head of the file.
The fsspec docs aren't very enlightening: https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=block_size#fsspec.spec.AbstractFileSystem.open says "some indication of buffering". In some implementations it is called the "read-ahead" size, which may make sense for sequential file access (read large chunks into local buffer a few times rather than reading small chunks many times). Perhaps then that is the case here (unfortunately, despite working with Scott on the initial implementation of fsspec-xrootd, I cannot recall myself what it is for), in which case a reasonable value for us may be 403 per
Line 165 in e47934f
(or better yet round to 1024.. this is a small number these days)
Ok I implemented a file handle cache in CoffeaTeam/fsspec-xrootd#54 and now the sites that previously had a lot of trouble with opening the file often are doing better.
On the uproot side, we can save one additional open if we default to the stateless form
uproot5/src/uproot/source/fsspec.py
Lines 99 to 103 in e47934f
@lobis is there is a reason you prefer a file handle here, maybe for reading local files?
Here's a summary of what I've attempted:
The legend entries correspond to progressively more patches to improve this situation:
+fsspec
: introducing the FSSpec source, as of v5.3.0+cat_ranges
: #1162+hcache
: CoffeaTeam/fsspec-xrootd#54+readall
: CoffeaTeam/coffea#1076+coalesce
: #1198
Note that for all sites the performance steadily improves, but it may in some cases have been improved by just one of the patches. I don't think any of these patches are harmful in alternative cases.
In my opinion, once all of these are merged we can close this issue
All the updates are merged. I'll close this for now.