scikit-hep / uproot5

When I convert TProfile objects to hist that I've read from a ROOT file with uproot, I sometimes observe inconsistent values. For some input files, they're consistent, but for others, they're inconsistent - It depends on the input file. Unfortunately, I haven't figured out how to reproduce this in a simple way, so I've just attached two files which are produced in the same way and exhibit this behavior: (failing.root.txt (11MB), passing.root.txt (< 1 MB)). Both appear to be valid output files beyond this issue. The code below illustrates the behavior:

def test_uproot_profile_consistency_with_hist() -> None:
    import hist
    import numpy as np
    import uproot

    input_file = Path("failing.root")

    # Open and extract with uproot
    with uproot.open(input_file) as f:
        # Cast to a base class. Actual class is here: https://github.com/alisw/AliPhysics/blob/master/PWG/EMCAL/EMCALbase/AliEmcalList.h
        output_list = f["AliAnalysisTaskTrackSkim_pythia"].bases[0]
        x_sec_uproot = next(h for h in output_list if h.name == "fHistXsection")
        x_sec_uproot_hist = x_sec_uproot.to_hist()

    # Cross check with ROOT
    ROOT = pytest.importorskip("ROOT")
    with ROOT.TFile.Open(str(input_file), "READ") as f_ROOT:
        output_list = f_ROOT.Get("AliAnalysisTaskTrackSkim_pythia")
        # Cast to a base class. Actual class is here: https://github.com/alisw/AliPhysics/blob/master/PWG/EMCAL/EMCALbase/AliEmcalList.h
        output_list = ROOT.bind_object(ROOT.addressof(output_list), "TList")

        x_sec_temp = output_list.FindObject("fHistXsection")
        x_sec_temp.SetDirectory(0)
    x_sec_ROOT_values = np.array([x_sec_temp.GetBinContent(i) for i in range(1, x_sec_temp.GetNbinsX() + 1)], dtype=np.float64)

    print(f"Uproot: {uproot.__version__}, hist: {hist.__version__}")

    # The standard uproot values before conversion appear correct
    np.testing.assert_allclose(x_sec_uproot.values(), x_sec_ROOT_values)
    # Fails here for `failing.root`, but fine for `passing.root`
    np.testing.assert_allclose(x_sec_uproot.values(), x_sec_uproot_hist.values())

It strikes as odd that it depends on the particular file when they were generated the same way (with different inputs). I wonder if it's due to some issue with accessing the base class/the cast that I do. It's required because the profile is stored in a TList derived class that's part of our experimental software stack (note that I don't care about any of the additional info stored in the derived class. It's available here if helpful). I'm concerned this is perhaps corrupting an expected memory layout or otherwise causing an issue. Since this class is unfortunately driven by an experiment software stack constraint, it's difficult for me to avoid. If there's a better way for me to workaround this, I'd be happy to use that instead.

Bottom line: I suspect there might be an uproot issue here, but I'm concerned that it's instead due to an experimental software stack quirk. Help here is greatly appreciated - thanks!

>>> import uproot, hist
... print(f"Uproot: {uproot.__version__}, hist: {hist.__version__}")
Uproot: 5.3.1, hist: 2.7.2

You were involved in #908, so you know about #1000 (and #1148 is someone else running into it): I'm not convinced that ROOT's TProfile and boost-histogram's WeightedMean Storage are the same thing.

It might be more helpful—you might see a difference between failing.root and passing.root—if you look at the raw quantities on that are stored in the file, the members of the Model_TProfile_v*.

The TH1D base class, which has values (direct bin contents):

uproot5/src/uproot/models/TH.py

Lines 3881 to 3891 in 21735bf

    
           self._bases.append( 
        
               file.class_named("TH1D", 3).read( 
        
                   chunk, 
        
                   cursor, 
        
                   context, 
        
                   file, 
        
                   self._file, 
        
                   self._parent, 
        
                   concrete=self.concrete, 
        
               ) 
        
           )

fBinEntries, which is another per-bin quantity:

uproot5/src/uproot/models/TH.py

Lines 3892 to 3894 in 21735bf

    
           self._members["fBinEntries"] = file.class_named("TArrayD").read( 
        
               chunk, cursor, context, file, self._file, self.concrete 
        
           )

and fBinSumw2:

uproot5/src/uproot/models/TH.py

Lines 3902 to 3904 in 21735bf

    
           self._members["fBinSumw2"] = file.class_named("TArrayD").read( 
        
               chunk, cursor, context, file, self._file, self.concrete 
        
           )

The way that the values and errors of the profile are computed from these raw quantities is not at all trivial:

uproot5/src/uproot/behaviors/TProfile.py

Lines 63 to 129 in 21735bf

    
           # closely follows the ROOT function, using the same names (with 'root_' prepended) 
        
           # https://github.com/root-project/root/blob/ffc7c588ac91aca30e75d356ea971129ee6a836a/hist/hist/src/TProfileHelper.h#L660-L721 
        
           def _values_errors_1d(error_mode, fBinEntries, root_cont, fSumw2, fNcells, fBinSumw2): 
        
               if error_mode is None or error_mode == _kERRORMEAN or error_mode == "": 
        
                   error_mode = _kERRORMEAN 
        
               elif error_mode == _kERRORSPREAD or error_mode == "s": 
        
                   error_mode = _kERRORSPREAD 
        
               elif error_mode == _kERRORSPREADI or error_mode == "i": 
        
                   error_mode = _kERRORSPREADI 
        
               elif error_mode == _kERRORSPREADG or error_mode == "g": 
        
                   error_mode = _kERRORSPREADG 
        
               else: 
        
                   raise ValueError( 
        
                       "error_mode must be None/0/'' for error-on-mean,\n" 
        
                       "                        1/'s' for spread (variance),\n" 
        
                       "                        2/'i' for integer spread (using sqrt(12)),\n" 
        
                       "                     or 3/'g' for Gaussian spread\n" 
        
                       "                    not " 
        
                       + repr(error_mode) 
        
                       + "see https://root.cern.ch/doc/master/classTProfile.html" 
        
                   ) 
        
               root_sum = fBinEntries 
        
               root_sum = numpy.asarray(root_sum, dtype=numpy.float64) 
        
               nonzero = root_sum != 0 
        
               root_contsum = numpy.zeros(len(root_cont), dtype=numpy.float64) 
        
               root_contsum[nonzero] = root_cont[nonzero] / root_sum[nonzero] 
        
               if error_mode == _kERRORSPREADG: 
        
                   out = numpy.zeros(len(root_cont), dtype=numpy.float64) 
        
                   out[nonzero] = 1.0 / numpy.sqrt(root_sum[nonzero]) 
        
                   return root_contsum, out 
        
               root_err2 = fSumw2 
        
               if root_err2 is None or len(root_err2) != fNcells: 
        
                   root_err2 = numpy.zeros(len(root_cont), dtype=numpy.float64) 
        
               else: 
        
                   root_err2 = numpy.asarray(root_err2, dtype=numpy.float64) 
        
               root_neff = _effective_counts_1d(fBinEntries, fBinSumw2, fNcells) 
        
               root_eprim2 = numpy.zeros(len(root_cont), dtype=numpy.float64) 
        
               root_eprim2[nonzero] = abs( 
        
                   root_err2[nonzero] / root_sum[nonzero] - root_contsum[nonzero] ** 2 
        
               ) 
        
               root_eprim = numpy.sqrt(root_eprim2) 
        
               if error_mode == _kERRORSPREADI: 
        
                   numer = numpy.ones(len(root_cont), dtype=numpy.float64) 
        
                   denom = numpy.full(len(root_cont), numpy.sqrt(12 * root_neff)) 
        
                   eprim_nonzero = root_eprim != 0 
        
                   numer[eprim_nonzero] = root_eprim[eprim_nonzero] 
        
                   denom[eprim_nonzero] = numpy.sqrt(root_neff[eprim_nonzero]) 
        
                   out = numpy.zeros(len(root_cont), dtype=numpy.float64) 
        
                   out[nonzero] = numer[nonzero] / denom[nonzero] 
        
                   return root_contsum, out 
        
               if error_mode == _kERRORSPREAD: 
        
                   root_eprim[~nonzero] = 0.0 
        
                   return root_contsum, root_eprim 
        
               out = numpy.zeros(len(root_cont), dtype=numpy.float64) 
        
               out[nonzero] = root_eprim[nonzero] / numpy.sqrt(root_neff[nonzero]) 
        
               return root_contsum, out

Thanks for the pointers Jim! I had missed #1148 - apologies.

After #1000, I did some validation with multiple files and it seemed to be working on individual TProfiles, so I was a bit surprised that some files work while others don't. However, I now see your point that the calculation is quite involved, which may explain the differences. I'll plan to dig into this further when I have some free moments

	self._bases.append(
	file.class_named("TH1D", 3).read(
	chunk,
	cursor,
	context,
	file,
	self._file,
	self._parent,
	concrete=self.concrete,
	)
	)

	self._members["fBinEntries"] = file.class_named("TArrayD").read(
	chunk, cursor, context, file, self._file, self.concrete
	)

	self._members["fBinSumw2"] = file.class_named("TArrayD").read(
	chunk, cursor, context, file, self._file, self.concrete
	)

	# closely follows the ROOT function, using the same names (with 'root_' prepended)
	# https://github.com/root-project/root/blob/ffc7c588ac91aca30e75d356ea971129ee6a836a/hist/hist/src/TProfileHelper.h#L660-L721
	def _values_errors_1d(error_mode, fBinEntries, root_cont, fSumw2, fNcells, fBinSumw2):
	if error_mode is None or error_mode == _kERRORMEAN or error_mode == "":
	error_mode = _kERRORMEAN
	elif error_mode == _kERRORSPREAD or error_mode == "s":
	error_mode = _kERRORSPREAD
	elif error_mode == _kERRORSPREADI or error_mode == "i":
	error_mode = _kERRORSPREADI
	elif error_mode == _kERRORSPREADG or error_mode == "g":
	error_mode = _kERRORSPREADG
	else:
	raise ValueError(
	"error_mode must be None/0/'' for error-on-mean,\n"
	" 1/'s' for spread (variance),\n"
	" 2/'i' for integer spread (using sqrt(12)),\n"
	" or 3/'g' for Gaussian spread\n"
	" not "
	+ repr(error_mode)
	+ "see https://root.cern.ch/doc/master/classTProfile.html"
	)

	root_sum = fBinEntries
	root_sum = numpy.asarray(root_sum, dtype=numpy.float64)
	nonzero = root_sum != 0

	root_contsum = numpy.zeros(len(root_cont), dtype=numpy.float64)
	root_contsum[nonzero] = root_cont[nonzero] / root_sum[nonzero]

	if error_mode == _kERRORSPREADG:
	out = numpy.zeros(len(root_cont), dtype=numpy.float64)
	out[nonzero] = 1.0 / numpy.sqrt(root_sum[nonzero])
	return root_contsum, out

	root_err2 = fSumw2
	if root_err2 is None or len(root_err2) != fNcells:
	root_err2 = numpy.zeros(len(root_cont), dtype=numpy.float64)
	else:
	root_err2 = numpy.asarray(root_err2, dtype=numpy.float64)

	root_neff = _effective_counts_1d(fBinEntries, fBinSumw2, fNcells)

	root_eprim2 = numpy.zeros(len(root_cont), dtype=numpy.float64)
	root_eprim2[nonzero] = abs(
	root_err2[nonzero] / root_sum[nonzero] - root_contsum[nonzero] ** 2
	)
	root_eprim = numpy.sqrt(root_eprim2)

	if error_mode == _kERRORSPREADI:
	numer = numpy.ones(len(root_cont), dtype=numpy.float64)
	denom = numpy.full(len(root_cont), numpy.sqrt(12 * root_neff))

	eprim_nonzero = root_eprim != 0
	numer[eprim_nonzero] = root_eprim[eprim_nonzero]
	denom[eprim_nonzero] = numpy.sqrt(root_neff[eprim_nonzero])

	out = numpy.zeros(len(root_cont), dtype=numpy.float64)
	out[nonzero] = numer[nonzero] / denom[nonzero]
	return root_contsum, out

	if error_mode == _kERRORSPREAD:
	root_eprim[~nonzero] = 0.0
	return root_contsum, root_eprim

	out = numpy.zeros(len(root_cont), dtype=numpy.float64)
	out[nonzero] = root_eprim[nonzero] / numpy.sqrt(root_neff[nonzero])
	return root_contsum, out

Sometimes inconsistent values in hist conversion from TProfile read via uproot