Integrating a selection from a TTree more efficiently than TTree::Draw()

Question

Integrating a selection from a TTree more efficiently than TTree::Draw()

gbesjes opened this issue 8 years ago · comments

I'm looking for a fast way to select certain events from a TTree, including one or more weights, and then integrating them.

A typical use-case for this is for example a cutflow. I now achieve that by selecting events into a histogram in a utility function:

selection = "({0}) * ({1})".format(selection, weight)
tree.Draw(var,
       selection=selection,
       hist=hist)

where this is created and filled from another function that does the following:

hist = Hist(1, -1, 2)
branch = tree.GetListOfBranches()[0].GetName()
loadHistogramFromTree(tree, hist, '{0}=={0}'.format(branch), cut, weight) 
return hist.Integral()

In other words, I just select a weighted set of entries from a TTree and am interested in what the total number of events passing a selection is.

I've looked into tree2array, but notice several downsides:

weight_name is just single weight, while I'm looking to multiply several. Perhaps a useful improvement would be the possibility to specify weight_names as a list of names?
the weights are branches that exist in the TTree. What if I'd like to scale to lumi by doing something like 12.345/0.001 * normWeight (where this column is a weight derived from the cross-section of the sample and 12.345 is an example lumi)

These issues would also solve for example a more efficient plotter if one wants to select events along the lines of something like

12.345 / 0.001 * join("*", weights) * (cut)

into histograms and stack them all.

Perhaps I misunderstand how to combine tree2array and array2hist to get a weighted histogram, but right now I'm stuck with the old-fashioned draw methods.

Would anybody have a suggestion? Anything that relies on something faster than TTree::Draw() would be great - that would allow me to benefit from the nice benchmark figures advertised :)

Noel Dawe · Answer 1 · Fri Sep 02 2016 11:05:53 GMT+0800 (China Standard Time)

@gbesjes thanks a lot for this feedback. I'll try to clarify a bit how you can accomplish what you need.

Firstly, tree2array's weight_name is just to assign a name to the field in the output array that will hold the value of tree.GetWeight() (same value for each entire tree in a chain). We just wanted a way to conveniently extract that info from the tree so it can be treated like any other branches that represent a weight. weight_name is configurable to give the user the ability to avoid clashing with existing branch names in the tree.

To fill a histogram with weighted entries where the weights are products of weight branches (and any other factors) then try something like this:

weight_branches = ['your', 'weights']
arr = tree2array(tree)
weights = reduce(np.multiply, [arr[br] for br in weight_branches])
# reduce is removed in py3 but can use functools.reduce or explicit for loop
fill_hist(hist, arr['branch'], weights)

Possibly even better is to just give tree2array the complete expression (at least the factors that are branches in the tree) that produce the entry weights:

weights = tree2array(tree, branches='branch1 * branch2 * 12.345 / 0.001')

Giordon Stark · Answer 2 · Fri Sep 02 2016 15:06:14 GMT+0800 (China Standard Time)

To fill a histogram with weighted entries where the weights are products of weight branches (and any other factors) then try something like this:

Actually, it would be much more efficient to add an alias to the tree and retrieve the branch by that alias. At least that way, you're relying directly on ROOT to do this.

tree.SetAlias('alias','formula')

and root_numpy can access this normally without a problem. We might want to think of incorporating something like this to make this less involved.

Noel Dawe · Answer 3 · Fri Sep 02 2016 15:09:01 GMT+0800 (China Standard Time)

But root_numpy anyway uses TTreeFormula for any expression that isn't a branch name. So an alias doesn't really change anything.

Noel Dawe · Answer 4 · Fri Sep 02 2016 15:09:24 GMT+0800 (China Standard Time)

(oops clicked wrong button 😄 )

Giordon Stark · Answer 5 · Fri Sep 02 2016 15:11:29 GMT+0800 (China Standard Time)

But root_numpy anyway uses TTreeFormula for any expression that isn't a branch name. So an alias doesn't really change anything.

Ahh, I didn't see the second point in your post. Yeah, if you're using TTreeFormula, then it should just work as expected!

Geert-Jan Besjes · Answer 6 · Fri Sep 02 2016 15:13:10 GMT+0800 (China Standard Time)

@ndawe thanks for the quick answer! That solves what I want to achieve. Would there be an obvious improvement in tree2array() if I specify a branch like "1==1" in this case? For a cutflow I'm not really interested in a variable, just a count is enough :)

And of course the same thing is true for a distribution: I assume that if I want to plot variables X, Y and Z the code will be a lot more performant if only those branches are thrown into numpy arrays. Is that indeed the case?

Noel Dawe · Answer 7 · Fri Sep 02 2016 15:13:33 GMT+0800 (China Standard Time)

Yeah, "branches" can be a list of branch names and/or expressions. It should be able to handle anything that you can throw at TTree.Draw(). We named the argument "branches" before expressions were supported. In hindsight something like "fields" might have been more appropriate.

Geert-Jan Besjes · Answer 8 · Fri Sep 02 2016 15:14:47 GMT+0800 (China Standard Time)

Excellent! I thought this wasn't possible because of the name "branches". I'll try it out after a few meetings. Perhaps the docs can clarify that any valid expression also works, in case other people run into the same.

Noel Dawe · Answer 9 · Fri Sep 02 2016 15:19:26 GMT+0800 (China Standard Time)

For a cutflow you could also just sum up the array lengths. Yes, if you only want a subset of the branches (possibly mixed with expressions) then specifying them with the branches argument will lead to the conversion only reading in and including those particular branches and expressions. If you only need O(10) branches in an ntuple containing thousands of branches, this can be a huge speedup.

Noel Dawe · Answer 10 · Fri Sep 02 2016 15:21:14 GMT+0800 (China Standard Time)

I'll improve the docs on branches. No problem.

Giordon Stark · Answer 11 · Fri Sep 02 2016 15:21:14 GMT+0800 (China Standard Time)

I believe there's also a special $Count(branch) formula you can use as well.

Geert-Jan Besjes · Answer 12 · Fri Sep 02 2016 15:23:36 GMT+0800 (China Standard Time)

For a raw cutflow I agree, but not if I'd like to multiply these events with something like their mcWeight and scale factor. Then the data are not [1, 1, 1, 1, ...] but they're weighted e.g. to [0.7, 0.8, 0.7, 0.9, ...]. Unless I'm overlooking something extremely obvious here? 😄

Noel Dawe · Answer 13 · Fri Sep 02 2016 15:27:33 GMT+0800 (China Standard Time)

Ah, yes indeed. Just use arr['weight_expression'].sum() for a weighted cutflow.

Geert-Jan Besjes · Answer 14 · Fri Sep 02 2016 20:55:24 GMT+0800 (China Standard Time)

The selection actually doesn't deal too nicely with vectorial indices. When a branch electrons_pt[0] is asked for, the data is structured as follows:

[array([ 85.83097839]) array([ 174.27775574]) array([ 87.52495575]) ...,
 array([ 711.8416748]) array([ 734.52056885]) array([ 107.2477417])]

That me as wrong: since I asked for a specific index, shouldn't each of these individual arrays be a float instead? Of course in a post-processing step this is pretty easy to achieve, but it's not a structure that I had expected to get back.

Giordon Stark · Answer 15 · Fri Sep 02 2016 21:12:03 GMT+0800 (China Standard Time)

Can you call foo.flatten() on that structure instead? Should probably work nicely.

Noel Dawe · Answer 16 · Fri Sep 02 2016 21:12:52 GMT+0800 (China Standard Time)

This last issue is coincidentally something I've been thinking about recently. The issue is that ROOT specifies a "multiplicity" for an expression telling us about the possible number of values to expect for each entry in the tree. In this case you expect either 1 or 0 values per entry since in some entry your electrons_pt array might be empty. But at the moment root_numpy doesn't have a mechanism to specify default values for when the expression single-element array is empty for a particular tree entry. I agree we need something like this to instead produce single elements with default values instead of nested arrays which are awkward to deal with.

For now, you can use root_numpy's stretch function: http://rootpy.github.io/root_numpy/reference/generated/root_numpy.stretch.html#root_numpy.stretch

electrons_pt = stretch(arr, fields=['electrons_pt[0]'])['electrons_pt[0]']

Noel Dawe · Answer 17 · Fri Sep 02 2016 21:14:09 GMT+0800 (China Standard Time)

@kratsg flatten won't work on dtype=object nested arrays, I believe.

Geert-Jan Besjes · Answer 18 · Fri Sep 02 2016 21:14:49 GMT+0800 (China Standard Time)

@kratsg, @ndawe : nope, it indeed won't:

print arr["electrons_pt[0]"].flatten()
[array([ 85.83097839]) array([ 174.27775574]) array([ 87.52495575]) ...,
 array([ 711.8416748]) array([ 734.52056885]) array([ 107.2477417])]

I suspected it had to do something with ROOT's internals indeed. Perhaps as an intermediate solution there could be a way for the user to specify that they're requesting a single entry from a vector, so that stretch() may be called automagically?

Noel Dawe · Answer 19 · Fri Sep 02 2016 21:16:23 GMT+0800 (China Standard Time)

yup, not on nested dtype=object. Use root_numpy.stretch.

If you use the latest master of root_numpy you can also use the shorter:

electrons_pt = stretch(arr, 'electrons_pt[0]')

Noel Dawe · Answer 20 · Fri Sep 02 2016 21:17:38 GMT+0800 (China Standard Time)

@kratsg if an array is dtype=object then it is already "flattened" since each element is just a PyObject pointer to whatever. numpy know's nothing about the shape/type of what is in each array element.

Giordon Stark · Answer 21 · Fri Sep 02 2016 21:19:18 GMT+0800 (China Standard Time)

I wasn't aware that the array objects were the C-type array objects and not the np.array objects. The latter case flattens correctly, the former doesn't.

Noel Dawe · Answer 22 · Fri Sep 02 2016 21:21:09 GMT+0800 (China Standard Time)

root_numpy uses dtype=object since in general the nested subarrays are variable-length, and can be doubly nested too.

Noel Dawe · Answer 23 · Fri Sep 02 2016 21:22:35 GMT+0800 (China Standard Time)

stretch is a very handy function when dealing with nested subarrays for objects within events 😄

Geert-Jan Besjes · Answer 24 · Fri Sep 02 2016 21:57:13 GMT+0800 (China Standard Time)

This works like a charm! Let's say that I've converted an entire tree into a structured numpy array. How do I best slice that with an additional cut? Take the cutflow example again: I can do a selection on "A > 5". If that is to be followed by a cut "B < 20" and then "C > 100", these can of course be added and I can re-run tree2array() - but then I'm doing that N times. Is there a cleverer way in pure numpy to achieve that, if the cuts are strings?

Noel Dawe · Answer 25 · Fri Sep 02 2016 22:01:59 GMT+0800 (China Standard Time)

I've used https://github.com/pydata/numexpr for things like that, or even just python's builtin eval function with an appropriate setting of globals/locals.

Giordon Stark · Answer 26 · Fri Sep 02 2016 22:02:17 GMT+0800 (China Standard Time)

http://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html

So if your array is foo, then you can do something like

foo[np.where(foo.njets > 2)]

where np.where(foo.njets > 2) is a conditional. You can also just do

np.where(np.logical_and(x >= 10, x <= 25)) as well.

Noel Dawe · Answer 27 · Fri Sep 02 2016 22:05:54 GMT+0800 (China Standard Time)

If the cuts are strings, then numexpr.evaluate() or eval() are probably your best bet. This is interesting and provides some example code: https://mail.scipy.org/pipermail/scipy-user/2010-November/027276.html

Noel Dawe · Answer 28 · Fri Sep 02 2016 22:08:22 GMT+0800 (China Standard Time)

Basically:

passes = numexpr.evaluate('some_very_complex_expression_string', array)
passed_events = array[passes]

Geert-Jan Besjes · Answer 29 · Fri Sep 02 2016 22:14:29 GMT+0800 (China Standard Time)

@kratsg: that's easy when writing it out - but it's a bit harder if the user specifies a configuration file: that would involve some nasty cut parsing. And accessing the vectorial branches would be more hackish.

Thanks for the numexpr suggestion! That works perfectly. Now I just gotta figure out how TMath::Phi_mpi_pi() can be implemented there :)

Giordon Stark · Answer 30 · Fri Sep 02 2016 22:16:56 GMT+0800 (China Standard Time)

@gbesjes -- I use numexpr for https://github.com/kratsg/Optimization which uses cutstrings that are configurable. The easiest way to incorporate Pi is pretty straightforward:

numexpr.evaluate('2*pi', {'pi': np.pi})

works just as well. Think of the dictionary you're adding in not as your data, but the namespace for your numerical expression.

Geert-Jan Besjes · Answer 31 · Fri Sep 02 2016 22:20:35 GMT+0800 (China Standard Time)

@kratsg : how do you deal with vectorial branches? numexpr doesn't appear to support indexing of them.

Noel Dawe · Answer 32 · Fri Sep 02 2016 22:23:27 GMT+0800 (China Standard Time)

@gbesjes you might need to first massage your array into another array with some things stretched flat or cropped to fixed length before passing to numexpr. I have a few functions to make these operations easier in personal code that I've been considering placing in root_numpy eventually. Hopefully soon!

Giordon Stark · Answer 33 · Fri Sep 02 2016 22:25:27 GMT+0800 (China Standard Time)

@gbesjes - what do you mean "vectorial" branches? Vector of vectors? Numpy doesn't handle these very well unfortunately (which means numexpr can't as easily). My workaround is just to make sure everything is "flattened" when I make it, so I end up having branches like jet_pt_0, jet_pt_1, ...

Geert-Jan Besjes · Answer 34 · Fri Sep 02 2016 22:27:07 GMT+0800 (China Standard Time)

@kratsg: a branch std::vector - like electrons_pt[]. I'll do the same thing and stretch them. Is there an automatised way to achieved this? My numpy knowledge is obviously lacking too much for what I want to do currently.

Geert-Jan Besjes · Answer 35 · Fri Sep 02 2016 23:00:35 GMT+0800 (China Standard Time)

@kratsg, @ndawe: would either of you have an example of how to achieve this unpacking of the vectors? I've tried to cook up a solution but can't figure out how to do it efficiently. One of the issues in using stretch() is that it's not guaranteed that each event has the same number of electrons.

Noel Dawe · Answer 36 · Wed Sep 07 2016 09:07:11 GMT+0800 (China Standard Time)

@gbesjes here is an example function from some personal code:

def subfixedlength(rec, length, fill_value=None, return_indices=False):         
    """                                                                         
    Truncate variable-length object fields to fixed length                      
    Cythonized version of this function will be introduced in root_numpy        
    If length==1 then the subarray will become a scalar.                        
    """                                                                         
    if not rec.shape[0]:                                                        
        raise ValueError("cannot truncate empty structured array")              
    first_rec = rec[0]                                                          
    if length == 1:                                                             
        # make this a scalar                                                    
        dtype = [(rec.dtype.names[i], first_rec[i].dtype)                       
                 for i in range(len(first_rec))]                                
    else:                                                                       
        dtype = [(rec.dtype.names[i], first_rec[i].dtype, (length,))            
                 for i in range(len(first_rec))]                                
    out = np.empty(rec.shape[0], dtype=dtype)                                   
    if fill_value is not None:                                                  
        if isinstance(fill_value, dict):                                        
            for name, value in fill_value:                                      
                out[name].fill(value)                                           
        else:                                                                   
            out.fill(fill_value)                                                
    indices = np.ones(rec.shape[0], dtype=bool)                                 
    idx = 0                                                                     
    if length == 1:                                                             
        for record in rec:                                                      
            if record[0].shape[0] == 0:                                         
                indices[idx] = False                                            
            else:                                                               
                for ifield, field in enumerate(record):                         
                    out[idx][ifield] = field[0]                                 
            idx += 1                                                            
    else:                                                                       
        for record in rec:                                                      
            if record[0].shape[0] < length:                                     
                indices[idx] = False                                            
            for ifield, field in enumerate(record):                             
                out[idx][ifield][:min(field.shape[0], length)] = field[:length] 
            idx += 1                                                            
    if return_indices:                                                          
        return out, indices                                                     
    return out

Clearly a bit slow since it's looping in python, but I want to Cythonize this and put something like it in root_numpy soon.

Noel Dawe · Answer 37 · Wed Sep 07 2016 09:09:29 GMT+0800 (China Standard Time)

root_numpy's stretch also has a return_indices argument (default False) when True will return the original indices of the elements in the subarrays which indirectly allows you to associate the object elements with the original event indices.

Noel Dawe · Answer 38 · Wed Sep 07 2016 09:17:31 GMT+0800 (China Standard Time)

for [[2, 3, 1], [0, 2]] for an object field in a structured array, stretch(a, return_indices=True) would give [2, 3, 1, 0, 2] and [0, 1, 2, 0, 1] as the indices (0 repeats for each first element in the subarrays).

Geert-Jan Besjes · Answer 39 · Mon Nov 21 2016 01:07:21 GMT+0800 (China Standard Time)

Hi @ndawe,
I'm afraid that that functionality is broken for me currently (crashes on the record[0].shape accessing). And it flattens all the vector branches to the same maximum length, which is not exactly desirable. It would be great to have a function in root_numpy which ensures that each column - let's say taus_eta - is truncated to whatever the maximum is for that column (or a user-specified column-dependent maximum) and which allows the user to set the type.

That is, for a certain array, something like

transform_array(array, maxima={"taus_eta" : 3, "electrons_eta" : 11}, types={"taus_eta": np.float32, "electrons_eta" : np.float32})

as that would give the user full control over the data, while massaging it into a format that numpy can take.

Noel Dawe · Answer 40 · Thu Nov 24 2016 14:20:03 GMT+0800 (China Standard Time)

I'll fix that crash due to record[0].shape. Regarding the truncation, this is ongoing work around #266 (see option 3 there)

Noel Dawe · Answer 41 · Thu Nov 24 2016 14:29:48 GMT+0800 (China Standard Time)

Can you also just paste the stacktrace you saw here?

Noel Dawe · Answer 42 · Sun Dec 18 2016 14:19:24 GMT+0800 (China Standard Time)

See #295. That will add support for truncation requested in #266