An efficient way to get formula expressions with multiplicity -1 as flat arrays

Question

An efficient way to get formula expressions with multiplicity -1 as flat arrays

ibab opened this issue 8 years ago · comments

In scikit-hep/root_pandas#30, we've realized that we need a way to extract the single-element lists that root_numpy returns if a formula expression has ROOT multiplicity -1 (meaning 0 or 1 elements per entry).

One way a flat array could be returned would be by using NaN whenever the number of elements is 0, and the value of the single element otherwise.
Unfortunately, that would only work for floating point values (although returning 0 for integers might also be an option).

Instead of changing the behaviour of root_numpy, it might also be possible to provide a helper function written in Cython that can index into the underlying per-entry arrays.

Noel Dawe · Answer 1 · Sat Aug 20 2016 15:07:41 GMT+0800 (China Standard Time)

I've got a cythonized function to crop variable-length subarrays to arrays of fixed length (filling with user-specified defaults as required) and with the option to remove the nestedness of arrays of single elements. This should provide a way to solve the issue, as you suggest.

I'm just wondering what the behaviour was in previous root_numpy versions. For multiplicity -1 and where entries had 0 elements did we have whatever happened to be in memory at that location being written to the array?

Nathanael Farley · Answer 2 · Wed Oct 12 2016 23:58:20 GMT+0800 (China Standard Time)

Finally had some time to investigate this.

Tried installing different root_numpy versions, but ran into a lot of problems, so decided to look at the source code.

As far as I can tell, previously (v4.4.0) the class FormulaColumn (used for converting from ROOT Branches to numpy array columns) was always double, so presumably it filled entries with 0 elements with NaN without argument.

Constructor for FormulaColumn in v4.4.0.

    FormulaColumn(std::string _name, TTreeFormula* _formula)
    {
        name = _name;
        formula = _formula;
        type = "Double_t";
        value = new double[1];
}

Now, FormulaColumn is typed via the template mechanism in C++. The new code allows more than just double type columns. (code is too scattered to paste here, but look at https://github.com/rootpy/root_numpy/blob/4.5.2/root_numpy/src/Column.h to see this) which leads to the ability to create object columns with lists in.

The new code which creates lists of ints or doubles instead of just flat doubles can be found https://github.com/rootpy/root_numpy/blob/d31d9727e1d63ac7f9979d561065980d62d975c9/root_numpy/src/tree.pyx#L397

Nathanael Farley · Answer 3 · Mon Nov 14 2016 18:55:15 GMT+0800 (China Standard Time)

@ndawe do you possibly still have that Cython function around? It would be useful to have an upstream way to fix this problem (rather than the hackier way I've done in scikit-hep/root_pandas#30).

Noel Dawe · Answer 4 · Tue Nov 15 2016 00:02:16 GMT+0800 (China Standard Time)

I've been on the fence for a while, but had some time now to collect my thoughts. What do you think of the following options?

Use a masked record array (with masked fields) to handle multiplicity -1 (1 or 0 elements)
http://stackoverflow.com/questions/7217606/how-can-i-mask-elements-of-a-record-array-in-numpy
Pros: potentially easy to implement. Fits naturally into numpy
Cons: need to manage two arrays. The mask is an entirely separate array in memory
and would be mostly wasteful if only a small fraction of fields need masking.
Add utility function to crop to fixed-length or impute with default values
Pros: maintain existing behaviour of root_numpy. I have a prototype that would need some clean up before placing in root_numpy.
Cons: will be creating a copy of the array with fields modified (can't be done in-place)
Specify cropped length and default value in tree2array's branches list with tuples:
- multiplicity -1: 1 or 0 elements.
  Impute (always yield single value) with (expr_string, default_value) in branches list
- multiplicity >=1: fixed and variable-length arrays.
  Crop with (expr_string, default_value, length) in branches list. default_value is used when array is shorter than requested cropped length.
Pros: maintain existing behaviour of root_numpy. Just providing new functionality through the branches= argument. Performs operations in-place using no additional memory. Not copying the array... Explicitly specify default values for each case separately (can of course be different types).
Cons: any?

Personally I'm leaning toward option 3 since it is explicit, doesn't involve implementing our own string parsing on top of ROOT's TFormula, and doesn't involve unnecessary copying.

Nathanael Farley · Answer 5 · Wed Nov 16 2016 00:34:43 GMT+0800 (China Standard Time)

Apologies if this is an obvious question, I'm not too familiar with root_numpy code and it doesn't seem obvious. Would changes in tree2array propogate/make sense to implement in root2array? Since root2array is what's used in root_pandas, it would be helpful to have that functionality in root2array.

Personally, I'm a fan of option 3 (involves the least amount of copying and/or holding things in memory).

As an aside, it would mean some changes in root_pandas but I don't think it would be too bad (although @ibab is much more qualified to comment on this!)

Noel Dawe · Answer 6 · Wed Nov 16 2016 03:22:51 GMT+0800 (China Standard Time)

No problem. To clarify, root2array internally calls the same function as tree2array when converting to an array, so both would change the same way.

Noel Dawe · Answer 7 · Sun Dec 18 2016 14:05:49 GMT+0800 (China Standard Time)

This is now implemented in my branch_spec branch:

https://github.com/ndawe/root_numpy/tree/branch_spec

Take root_numpy from there if you want early access to this. I'll open a PR now and update the docs before making a new release (will be 4.7.0).

🎉

Noel Dawe · Answer 8 · Wed Dec 21 2016 09:31:53 GMT+0800 (China Standard Time)

#295 is merged! This is now in.

Geert-Jan Besjes · Answer 9 · Thu Dec 22 2016 20:31:16 GMT+0800 (China Standard Time)

Hi @ndawe, thanks a lot for this very useful feature that also fixes some of the things discussed on #270 ! When do you plan to release 4.7.0?

It would be great to have this functionality too in root2hdf5 as well; i.e. to specify a truncation length per vector branch. (Rather than a global setting, a per-branch selection is probably a better use-case for the real world.) Do you think that that will be possible?

Noel Dawe · Answer 10 · Tue Jan 03 2017 08:38:11 GMT+0800 (China Standard Time)

Back from holiday! I'll release 4.7.0 today or tomorrow.

Good point about root2hdf5. This will definitely be possible.

Geert-Jan Besjes · Answer 11 · Tue Jan 10 2017 22:28:56 GMT+0800 (China Standard Time)

Hi @ndawe,

Happy new year and thanks a lot for the release! Since we wanted to go play around with PyTables to interface things, I wondered if the root2hdf5 function (whether call from the commandline or in user code) supports this already. It looks like the userfunc argument can't be used for this, as root2hdf5() expects a tree to come out of this callback and not a numpy array.

Noel Dawe · Answer 12 · Wed Jan 11 2017 08:30:04 GMT+0800 (China Standard Time)

Hi @gbesjes Happy new year to you too! I'm going to update the root2hdf5 functions to accept extra keyword arguments that can be passed to the underlying tree2array function calls.

Noel Dawe · Answer 13 · Wed Jan 11 2017 09:24:13 GMT+0800 (China Standard Time)

Can you try rootpy 0.9.0 and pass your custom branches list as a branches= keyword argument to tree2hdf5 or root2hdf5?