pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overflow in _umath._cartesian_product when working with big sizes

mayalinetsky-kryon opened this issue · comments

Describe the bug
When working with arrays with sizes of 32 bit (i.e an array with a max size 2**31-1), the function _cartesian_product encounters overflow and calculates a negative value for rows:

def _cartesian_product(*arrays):
    Get the cartesian product of a number of arrays.

    Parameters
    ----------
    *arrays : Tuple[np.ndarray]
        The arrays to get a cartesian product of. Always sorted with respect
        to the original array.

    Returns
    -------
    out : np.ndarray
        The overall cartesian product of all the input arrays.
    """
    broadcastable = np.ix_(*arrays)
    broadcasted = np.broadcast_arrays(*broadcastable)
    rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
    dtype = np.result_type(*arrays)
    out = np.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows)

When broadcasted[0] is with a shape (a,b) where a times b would need more than 32 bits, then rows would come out negative because of overflow in np.prod(broadcasted[0].shape), because the dtype was not specified to np.prod and the system default integer is 32 bit.

To Reproduce
I had two really big sparse matrices and ran

bitarray1 = sparse.random((2**11,2**18),nnz=2**20,format='coo')
bitarray2 = sparse.random((2**11,2**18),nnz=2**20,format='coo')
np.sum(np.logical_or(bitarray1[:, None, :], bitarray2[None, :, :]),axis=2)

Idea for a solution
I am pretty new to contributing to open-source code, so I'll just write my suggestion here:
switch the row rows, cols = np.prod(broadcasted[0].shape), len(broadcasted) with

rows, cols = np.prod(broadcasted[0].shape, dtype=np.int64), len(broadcasted)

System

  • OS and version: [Windows 10 Pro]
  • sparse version 0.13.0
  • NumPy version 1.21.5
  • Numba version 0.55.1

@mayalinetsky-kryon I did not follow the whole post, but if you know that you are dealing with indexing operations in NumPy, you should always use dtype=np.intp.
This is a bit tricky in the sense that windows defaults to 32bit integers even on 64bit systems (so it takes a bit more memory and some functions will give you 32bit integers rather than 64bit).

But it will generally intp is both faster (if it differs) and always large enough to index any array.

I think we internally have a "default dtype" being used somewhere. I'll try to identify the cause.

This no longer raises an error for me but goes out of memory (which is the correct behaviour).