scikit-hep / root_numpy

The interface between ROOT and NumPy

Home Page:http://scikit-hep.org/root_numpy

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

An efficient way to get formula expressions with multiplicity -1 as flat arrays

ibab opened this issue · comments

In scikit-hep/root_pandas#30, we've realized that we need a way to extract the single-element lists that root_numpy returns if a formula expression has ROOT multiplicity -1 (meaning 0 or 1 elements per entry).

One way a flat array could be returned would be by using NaN whenever the number of elements is 0, and the value of the single element otherwise.
Unfortunately, that would only work for floating point values (although returning 0 for integers might also be an option).

Instead of changing the behaviour of root_numpy, it might also be possible to provide a helper function written in Cython that can index into the underlying per-entry arrays.

I've got a cythonized function to crop variable-length subarrays to arrays of fixed length (filling with user-specified defaults as required) and with the option to remove the nestedness of arrays of single elements. This should provide a way to solve the issue, as you suggest.

I'm just wondering what the behaviour was in previous root_numpy versions. For multiplicity -1 and where entries had 0 elements did we have whatever happened to be in memory at that location being written to the array?

Finally had some time to investigate this.

Tried installing different root_numpy versions, but ran into a lot of problems, so decided to look at the source code.

As far as I can tell, previously (v4.4.0) the class FormulaColumn (used for converting from ROOT Branches to numpy array columns) was always double, so presumably it filled entries with 0 elements with NaN without argument.

Constructor for FormulaColumn in v4.4.0.

    FormulaColumn(std::string _name, TTreeFormula* _formula)
    {
        name = _name;
        formula = _formula;
        type = "Double_t";
        value = new double[1];
}

Now, FormulaColumn is typed via the template mechanism in C++. The new code allows more than just double type columns. (code is too scattered to paste here, but look at https://github.com/rootpy/root_numpy/blob/4.5.2/root_numpy/src/Column.h to see this) which leads to the ability to create object columns with lists in.

The new code which creates lists of ints or doubles instead of just flat doubles can be found https://github.com/rootpy/root_numpy/blob/d31d9727e1d63ac7f9979d561065980d62d975c9/root_numpy/src/tree.pyx#L397

@ndawe do you possibly still have that Cython function around? It would be useful to have an upstream way to fix this problem (rather than the hackier way I've done in scikit-hep/root_pandas#30).

I've been on the fence for a while, but had some time now to collect my thoughts. What do you think of the following options?

  1. Use a masked record array (with masked fields) to handle multiplicity -1 (1 or 0 elements)
    http://stackoverflow.com/questions/7217606/how-can-i-mask-elements-of-a-record-array-in-numpy
    Pros: potentially easy to implement. Fits naturally into numpy
    Cons: need to manage two arrays. The mask is an entirely separate array in memory
    and would be mostly wasteful if only a small fraction of fields need masking.

  2. Add utility function to crop to fixed-length or impute with default values
    Pros: maintain existing behaviour of root_numpy. I have a prototype that would need some clean up before placing in root_numpy.
    Cons: will be creating a copy of the array with fields modified (can't be done in-place)

  3. Specify cropped length and default value in tree2array's branches list with tuples:

    • multiplicity -1: 1 or 0 elements.
      Impute (always yield single value) with (expr_string, default_value) in branches list
    • multiplicity >=1: fixed and variable-length arrays.
      Crop with (expr_string, default_value, length) in branches list. default_value is used when array is shorter than requested cropped length.

    Pros: maintain existing behaviour of root_numpy. Just providing new functionality through the branches= argument. Performs operations in-place using no additional memory. Not copying the array... Explicitly specify default values for each case separately (can of course be different types).
    Cons: any?

Personally I'm leaning toward option 3 since it is explicit, doesn't involve implementing our own string parsing on top of ROOT's TFormula, and doesn't involve unnecessary copying.

Apologies if this is an obvious question, I'm not too familiar with root_numpy code and it doesn't seem obvious. Would changes in tree2array propogate/make sense to implement in root2array? Since root2array is what's used in root_pandas, it would be helpful to have that functionality in root2array.

Personally, I'm a fan of option 3 (involves the least amount of copying and/or holding things in memory).

As an aside, it would mean some changes in root_pandas but I don't think it would be too bad (although @ibab is much more qualified to comment on this!)

No problem. To clarify, root2array internally calls the same function as tree2array when converting to an array, so both would change the same way.

This is now implemented in my branch_spec branch:

https://github.com/ndawe/root_numpy/tree/branch_spec

Take root_numpy from there if you want early access to this. I'll open a PR now and update the docs before making a new release (will be 4.7.0).

🎉

#295 is merged! This is now in.

Hi @ndawe, thanks a lot for this very useful feature that also fixes some of the things discussed on #270 ! When do you plan to release 4.7.0?

It would be great to have this functionality too in root2hdf5 as well; i.e. to specify a truncation length per vector branch. (Rather than a global setting, a per-branch selection is probably a better use-case for the real world.) Do you think that that will be possible?

Back from holiday! I'll release 4.7.0 today or tomorrow.

Good point about root2hdf5. This will definitely be possible.

Hi @ndawe,

Happy new year and thanks a lot for the release! Since we wanted to go play around with PyTables to interface things, I wondered if the root2hdf5 function (whether call from the commandline or in user code) supports this already. It looks like the userfunc argument can't be used for this, as root2hdf5() expects a tree to come out of this callback and not a numpy array.

Hi @gbesjes Happy new year to you too! I'm going to update the root2hdf5 functions to accept extra keyword arguments that can be passed to the underlying tree2array function calls.

Can you try rootpy 0.9.0 and pass your custom branches list as a branches= keyword argument to tree2hdf5 or root2hdf5?