scikit-hep / uproot5

ROOT I/O in pure Python and NumPy.

Home Page:https://uproot.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cryptic error message when filling histogram in the wrong way with Dask

alexander-held opened this issue · comments

While experimenting with ATLAS PHYSLITE files (we might be able to find a way to share something for debugging purposes, details to be figured out privately) I ran into a setup that does not make sense on paper: filling a histogram with a jagged array without flattening it. I did not at first realize that this is the case and the error message is somewhat misleading. Full details are available in https://gist.github.com/alexander-held/56e203690c0d7f67218a4c67d46c586f, for uproot this is using the current version of main (4112159).

I am opening this issue to inquire if the error message for such a case can be improved.

Two errors are raised in my trace (top of the notebook):

ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array.

The above exception was the direct cause of the following exception:

and

ValueError: basket 323 in tree/branch /CollectionTree;1:METAssoc_AnalysisMETAux./METAssoc_AnalysisMETAux.jetLink has the wrong number of bytes (3318) for interpretation AsStridedObjects(Model_ElementLink_3c_DataVector_3c_xAOD_3a3a_Jet_5f_v1_3e3e__v1)
in file DAOD_PHYSLITE.34857549._000351.pool.root.1

Neither of these to me sound obviously related to the actual issue and in particular the second one is confusing as I do not need to access this object at all.

In a setup without Dask (further down in the notebook), the error is more understandable:

ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-26/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

This error occurred while calling

    numpy.asarray(
        <Array [[], [], [], ..., [], [9.61e+03]] type='163363 * var * float32'>
        dtype = None
    )

In particular, in this case the trace also includes boost_histogram to suggest that the histogramming is a problem here, while in the Dask case the trace fully stays within uproot + Dask.

In both case (Dask or no Dask), flattening the array (commented out in the notebook) makes things work as expected.

I agree that the second error message makes sense: you can't fill a histogram with an unflattenable ragged array.

The first error message text is the text you would get with a DeserializationError, but the type of the exception is ValueError. It's begin swapped somewhere in dask-awkward.

But also, they're not the same error. With the DeserializationError (TBasket seems to have the wrong number of bytes), you never get to the stage where you have an Awkward Array. With the histogram-filling error, the error is in filling the histogram with an existing Awkward Array.

What we're seeing here is that when dask-awkward asks Uproot to interpret the file, Uproot can't read it, but when calling Uproot directly to read the file, it can. The hisogram-filling problem afterward is not related to that because it comes much later. It doesn't make sense to me that dask-awkward would encounter a DeserializationError on the same file the Uproot can read, unless it's an intermittent DeserializationError (server sends good and bad versions of the file randomly?) that you just happened to see on a dask-awkward run and not on the only-Uproot run.

There's a problem here to solve, but it doesn't have to do with histograms. We might need more tests to narrow in on what's actually going wrong.

(When we have a better idea of what's going on, we'll need to change the title of the Issue.)

I think this issue has resolved itself: I just tried it again with updated dependencies and now obtain a reasonable error message:

    self._hist.fill(*args_ars, weight=weight_ars, sample=sample_ars)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: All arrays must be 1D

This is using the following:

awkward            2.5.2
awkward-cpp        28
boost-histogram    1.4.0
coffea             2024.1.2
dask               2024.1.0
dask-awkward       2024.1.2
dask-histogram     2023.10.0
hist               2.7.2
uproot             5.2.1

Given that it now seems fine, from my side we can close this one.

unless it's an intermittent DeserializationError (server sends good and bad versions of the file randomly?) that you just happened to see on a dask-awkward run and not on the only-Uproot run

I was running locally with a local file, so this was presumably not the case. Some updates in dask-awkward or elsewhere must have fixed this instead I assume.

Some updates in dask-awkward or elsewhere must have fixed this instead I assume.

That's entirely likely, and I'm also willing to be optimistic, so I'll close this issue now. Thanks!