pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

numpy ufuncs can change the compression of a gcxs array

jamestwebber opened this issue Β· comments

I didn't see a redundant issue for this but I may have missed it.

Describe the bug

I have a GCXS array with compressed rows (aka a CSR array). I want to transform the data with e.g. sqrt or log1p. numpy ufuncs work great for this, but surprisingly I end up with a CSC array afterwards. I'm not sure why this would happen--it should basically just apply the ufunc to data and return the same indices.

I've been trying to write sparse-friendly code which means I need to use the array internals directly, but if the compression changes I get the wrong results (and/or will do something inefficient)

To Reproduce

import numpy as np
import sparse

m = sparse.random((10, 8), density=0.1, format="gcxs", compressed_axes=(0,))
assert m.compressed_axes == (0,)
assert np.sqrt(m).compressed_axes == (1,)

Expected behavior
I didn't expect my compression to change, and downstream methods were specialized for compressed rows.

System

  • OS and version = debian 11 VM on GCP
  • sparse version '0.14.0'
  • NumPy version '1.24.3'
  • Numba version '0.57.0'

Just to add on a bit: I am happy to look into this...later πŸ˜‚ on a deadline for now

Looking into this a little bit, it looks like _Elemwise doesn't preserve any info about its input and so when it returns the output, it uses the default asformat conversion. For a GCXS array this means it gets to choose what direction to compress and it might choose differently.

Maybe this isn't considered a bug but it's a little unexpected here. It is simple enough to fix this particular behavior by storing compressed_axes and passing it to the re-formatting later*, but there might be a more elegant general solution.

* with the caveat that I don't know what to do with multiple sparse_args

I'd compress everything I can. But on mixed types, default to COO.

I'd compress everything I can. But on mixed types, default to COO.

Not sure what you mean by this? I think I need to read up on the __array_ufunc__ call, I'm not clear on what is possible in the arguments.

I was hoping to be able to write some fast paths for unary operations like sqrt but after reading more about the potential arguments to ufunc I'm wary of trying it. There should be a way to skip the conversion to COO for single-argument ufuncs, but the possibility of kwargs makes that trickier. I will have to keep thinking about it because I do think there's potential there--in a basic test it sped up operations on GCXS arrays considerably.

In the meantime, a simple enhancement here is to preserve the compressed_axes for GCXS if there's only one tuple present in the args. e.g. in the unary case, or if both have compressed rows. Does that make sense?

Feel free to open a PR. πŸ˜‰

I was working on a more-specialized version and I was surprised to find __array_ufunc__ being used in other places, e.g. astype, round, clip. This meant I couldn't assume I had the ufunc interface when dispatching, which is a little inconvenient.

I'm not sure whether it makes sense to rewrite these functions without calling __array_ufunc__. The upside is that the ufunc code could be simpler and perhaps more efficient, while the downside is that these functions might get more complex (although I think they will have a consistent pattern).

Another question is whether the functions in question should be ufuncs themselves. But that's a question for numpy.

I'll leave the decision upto you. πŸ˜‰

Yeah if I have the time I will see what it looks like to modify them, perhaps using scipy.sparse as a template there.

Last I checked, scipy.sparse didn't do __array_ufunc__ or fill values; which limits its choice of functions as well.

Yeah I meant their code for the non-ufunc functions, which are only a handful of things. But I will have to see how complex it gets, and not right now because I'm procrastinating on real work. πŸ˜…

Yeah if I have the time I will see what it looks like to modify them, perhaps using scipy.sparse as a template there.

scipy.sparse has a couple layers of class inheritance to make this "simple" and so it'd be a bit heavy to do it their way.

Basically it comes down to DOK vs everyone else. If you have a data attribute then it's simple to transform all the values. scipy uses an abstract class to represent all the data-having versions. It's possible to emulate that, just a bigger refactor than I wanted.

There might be something better that could be done for this package but it's more than I can take on right now.