sparse.COO.broadcast_to ValueError for COO with dense size bigger than 2**32-1

Question

sparse.COO.broadcast_to ValueError for COO with dense size bigger than 2**32-1

mayalinetsky-kryon opened this issue 2 years ago · comments

Bug Description
When broadcasting a sparse matrix <COO: shape=(6318, 35, 1, 1, 17806), dtype=uint8, nnz=5698564, fill_value=0> to shape (6318,35,36,21,17806) I get:
ValueError: could not broadcast input array from shape (4308114384,) into shape (13147088,)

To Reproduce

import random, sparse
# Create a 5-dimensional sparse matrix with about [nnz] non-zero values
nnz =5698564
shape = (6318, 35, 1, 1, 17806)
data = []
coords = [[],[],[],[],[]]
for i,j,k,m,z in zip([random.randint(0,shape[0]-1) for _ in range(0,nnz)],[random.randint(0,shape[1]-1) for _ in range(0,nnz)],[random.randint(0,shape[2]-1) for _ in range(0,nnz)],[random.randint(0,shape[3]-1) for _ in range(0,nnz)],[random.randint(0,shape[4]-1) for _ in range(0,nnz)]):
    coords[0].append(i)
    coords[1].append(j)
    coords[2].append(k)
    coords[3].append(m)
    coords[4].append(z)
    data.append(1)

matrix = sparse.COO(coords,data,shape = shape,fill_value=0)

# Broadcast to shape (6318,35,36,21,17806)
matrix_broadcasted = matrix.broadcast_to((6318,35,36,21,17806))

Expected behavior
I expected that matrix_broadcasted would be with shape (6318,35,36,21,17806) and would hold
matrix_broadcasted[:,:,k,m,:] is equal to matrix[:,:,0,0,:] for every pair index of k,m.

System

OS and version: Windows 10 Pro Version 21H1
sparse version '0.13.0'
NumPy version '1.21.5'
Numba version '0.55.1'

Maya Linetsky · Answer 1 · Tue Apr 26 2022 15:30:13 GMT+0800 (China Standard Time)

I have followed the error into the function _cartesian_product in sparse._umath.py module.
There is a line:
rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
Which is calculating the number of indices that need to be added to other.coords, and puts it into the variable rows.

Here is a screenshot from the debugger:

In this case, rows needs to be 5698564*36*21 = 4,308,114,384 which is a little more than 2**32+2**23. Let's call this "real_rows".
But, the returned value from np.prod is 13147088 which is exactly real_rows-2**32.

I looked into np.prod, and the default returned dtype is the dtype of the input array, which is the dtype of broadcasted[0].shape, which is "int":

According to here the "int" is the default platform int32, and this is why np.prod is experiencing overflow.

How can I make the default dtype of np.ndarray.shape be int64?

Hameer Abbasi · Answer 2 · Tue Apr 26 2022 16:37:37 GMT+0800 (China Standard Time)

Try dtype=np.int64 when creating the sparse array.

Maya Linetsky · Answer 3 · Wed Apr 27 2022 14:43:21 GMT+0800 (China Standard Time)

I have tried to replace this line in sparse._umath._cartesian_product:
rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
with:
rows, cols = np.prod(broadcasted[0].shape, dtype=np.int64), len(broadcasted)
And the value returned to rows is the correct value.
The problem this time was the next line of code:
out = np.empty(rows * cols, dtype=dtype)
Which needed 96 GiB of space for the allocation, and threw a memory error.

Bottom line: I created a data structure so big that it's indices needed more than 32 bits.
The sparse package doesn't really need to support indices this big because the array itself takes too much space.
I'll try to implement my code with a smaller data structure.

Thanks for your help.