numpy ufuncs can change the compression of a gcxs array

Question

numpy ufuncs can change the compression of a gcxs array

jamestwebber opened this issue a year ago · comments

I didn't see a redundant issue for this but I may have missed it.

Describe the bug

I have a GCXS array with compressed rows (aka a CSR array). I want to transform the data with e.g. sqrt or log1p. numpy ufuncs work great for this, but surprisingly I end up with a CSC array afterwards. I'm not sure why this would happen--it should basically just apply the ufunc to data and return the same indices.

I've been trying to write sparse-friendly code which means I need to use the array internals directly, but if the compression changes I get the wrong results (and/or will do something inefficient)

To Reproduce

import numpy as np
import sparse

m = sparse.random((10, 8), density=0.1, format="gcxs", compressed_axes=(0,))
assert m.compressed_axes == (0,)
assert np.sqrt(m).compressed_axes == (1,)

Expected behavior
I didn't expect my compression to change, and downstream methods were specialized for compressed rows.

System

OS and version = debian 11 VM on GCP
sparse version '0.14.0'
NumPy version '1.24.3'
Numba version '0.57.0'

James Webber · Answer 1 · Wed Jun 28 2023 05:53:57 GMT+0800 (China Standard Time)

Just to add on a bit: I am happy to look into this...later 😂 on a deadline for now

James Webber · Answer 2 · Fri Jun 30 2023 04:07:43 GMT+0800 (China Standard Time)

Looking into this a little bit, it looks like _Elemwise doesn't preserve any info about its input and so when it returns the output, it uses the default asformat conversion. For a GCXS array this means it gets to choose what direction to compress and it might choose differently.

Maybe this isn't considered a bug but it's a little unexpected here. It is simple enough to fix this particular behavior by storing compressed_axes and passing it to the re-formatting later*, but there might be a more elegant general solution.

* with the caveat that I don't know what to do with multiple sparse_args

Hameer Abbasi · Answer 3 · Fri Jun 30 2023 05:00:05 GMT+0800 (China Standard Time)

I'd compress everything I can. But on mixed types, default to COO.

James Webber · Answer 4 · Fri Jun 30 2023 08:05:27 GMT+0800 (China Standard Time)

I'd compress everything I can. But on mixed types, default to COO.

Not sure what you mean by this? I think I need to read up on the __array_ufunc__ call, I'm not clear on what is possible in the arguments.

James Webber · Answer 5 · Mon Jul 03 2023 00:30:24 GMT+0800 (China Standard Time)

I was hoping to be able to write some fast paths for unary operations like sqrt but after reading more about the potential arguments to ufunc I'm wary of trying it. There should be a way to skip the conversion to COO for single-argument ufuncs, but the possibility of kwargs makes that trickier. I will have to keep thinking about it because I do think there's potential there--in a basic test it sped up operations on GCXS arrays considerably.

In the meantime, a simple enhancement here is to preserve the compressed_axes for GCXS if there's only one tuple present in the args. e.g. in the unary case, or if both have compressed rows. Does that make sense?

Hameer Abbasi · Answer 6 · Mon Jul 03 2023 00:47:34 GMT+0800 (China Standard Time)

Feel free to open a PR. 😉

James Webber · Answer 7 · Mon Jul 10 2023 23:47:27 GMT+0800 (China Standard Time)

I was working on a more-specialized version and I was surprised to find __array_ufunc__ being used in other places, e.g. astype, round, clip. This meant I couldn't assume I had the ufunc interface when dispatching, which is a little inconvenient.

I'm not sure whether it makes sense to rewrite these functions without calling __array_ufunc__. The upside is that the ufunc code could be simpler and perhaps more efficient, while the downside is that these functions might get more complex (although I think they will have a consistent pattern).

Another question is whether the functions in question should be ufuncs themselves. But that's a question for numpy.

Hameer Abbasi · Answer 8 · Mon Jul 10 2023 23:52:26 GMT+0800 (China Standard Time)

I'll leave the decision upto you. 😉

James Webber · Answer 9 · Mon Jul 10 2023 23:57:27 GMT+0800 (China Standard Time)

Yeah if I have the time I will see what it looks like to modify them, perhaps using scipy.sparse as a template there.

Hameer Abbasi · Answer 10 · Tue Jul 11 2023 00:01:44 GMT+0800 (China Standard Time)

Last I checked, scipy.sparse didn't do __array_ufunc__ or fill values; which limits its choice of functions as well.

James Webber · Answer 11 · Tue Jul 11 2023 00:05:53 GMT+0800 (China Standard Time)

Yeah I meant their code for the non-ufunc functions, which are only a handful of things. But I will have to see how complex it gets, and not right now because I'm procrastinating on real work. 😅

James Webber · Answer 12 · Tue Aug 01 2023 10:42:43 GMT+0800 (China Standard Time)

Yeah if I have the time I will see what it looks like to modify them, perhaps using scipy.sparse as a template there.

scipy.sparse has a couple layers of class inheritance to make this "simple" and so it'd be a bit heavy to do it their way.

Basically it comes down to DOK vs everyone else. If you have a data attribute then it's simple to transform all the values. scipy uses an abstract class to represent all the data-having versions. It's possible to emulate that, just a bigger refactor than I wanted.

There might be something better that could be done for this package but it's more than I can take on right now.