Overflow in _umath._cartesian_product when working with big sizes

Question

Overflow in _umath._cartesian_product when working with big sizes

mayalinetsky-kryon opened this issue 2 years ago · comments

Describe the bug
When working with arrays with sizes of 32 bit (i.e an array with a max size 2**31-1), the function _cartesian_product encounters overflow and calculates a negative value for rows:

def _cartesian_product(*arrays):
    Get the cartesian product of a number of arrays.

    Parameters
    ----------
    *arrays : Tuple[np.ndarray]
        The arrays to get a cartesian product of. Always sorted with respect
        to the original array.

    Returns
    -------
    out : np.ndarray
        The overall cartesian product of all the input arrays.
    """
    broadcastable = np.ix_(*arrays)
    broadcasted = np.broadcast_arrays(*broadcastable)
    rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
    dtype = np.result_type(*arrays)
    out = np.empty(rows * cols, dtype=dtype)
    start, end = 0, rows
    for a in broadcasted:
        out[start:end] = a.reshape(-1)
        start, end = end, end + rows
    return out.reshape(cols, rows)

When broadcasted[0] is with a shape (a,b) where a times b would need more than 32 bits, then rows would come out negative because of overflow in np.prod(broadcasted[0].shape), because the dtype was not specified to np.prod and the system default integer is 32 bit.

To Reproduce
I had two really big sparse matrices and ran

bitarray1 = sparse.random((2**11,2**18),nnz=2**20,format='coo')
bitarray2 = sparse.random((2**11,2**18),nnz=2**20,format='coo')
np.sum(np.logical_or(bitarray1[:, None, :], bitarray2[None, :, :]),axis=2)

Idea for a solution
I am pretty new to contributing to open-source code, so I'll just write my suggestion here:
switch the row rows, cols = np.prod(broadcasted[0].shape), len(broadcasted) with

rows, cols = np.prod(broadcasted[0].shape, dtype=np.int64), len(broadcasted)

System

OS and version: [Windows 10 Pro]
sparse version 0.13.0
NumPy version 1.21.5
Numba version 0.55.1

Sebastian Berg · Answer 1 · Thu Jun 16 2022 21:13:21 GMT+0800 (China Standard Time)

@mayalinetsky-kryon I did not follow the whole post, but if you know that you are dealing with indexing operations in NumPy, you should always use dtype=np.intp.
This is a bit tricky in the sense that windows defaults to 32bit integers even on 64bit systems (so it takes a bit more memory and some functions will give you 32bit integers rather than 64bit).

But it will generally intp is both faster (if it differs) and always large enough to index any array.

Hameer Abbasi · Answer 2 · Thu Jun 16 2022 22:21:54 GMT+0800 (China Standard Time)

I think we internally have a "default dtype" being used somewhere. I'll try to identify the cause.

Hameer Abbasi · Answer 3 · Fri Jan 05 2024 14:46:22 GMT+0800 (China Standard Time)

This no longer raises an error for me but goes out of memory (which is the correct behaviour).