sparse.COO.broadcast_to ValueError for COO with dense size bigger than 2**32-1
mayalinetsky-kryon opened this issue · comments
Bug Description
When broadcasting a sparse matrix <COO: shape=(6318, 35, 1, 1, 17806), dtype=uint8, nnz=5698564, fill_value=0> to shape (6318,35,36,21,17806) I get:
ValueError: could not broadcast input array from shape (4308114384,) into shape (13147088,)
To Reproduce
import random, sparse
# Create a 5-dimensional sparse matrix with about [nnz] non-zero values
nnz =5698564
shape = (6318, 35, 1, 1, 17806)
data = []
coords = [[],[],[],[],[]]
for i,j,k,m,z in zip([random.randint(0,shape[0]-1) for _ in range(0,nnz)],[random.randint(0,shape[1]-1) for _ in range(0,nnz)],[random.randint(0,shape[2]-1) for _ in range(0,nnz)],[random.randint(0,shape[3]-1) for _ in range(0,nnz)],[random.randint(0,shape[4]-1) for _ in range(0,nnz)]):
coords[0].append(i)
coords[1].append(j)
coords[2].append(k)
coords[3].append(m)
coords[4].append(z)
data.append(1)
matrix = sparse.COO(coords,data,shape = shape,fill_value=0)
# Broadcast to shape (6318,35,36,21,17806)
matrix_broadcasted = matrix.broadcast_to((6318,35,36,21,17806))
Expected behavior
I expected that matrix_broadcasted would be with shape (6318,35,36,21,17806) and would hold
matrix_broadcasted[:,:,k,m,:]
is equal to matrix[:,:,0,0,:]
for every pair index of k,m
.
System
- OS and version: Windows 10 Pro Version 21H1
sparse
version '0.13.0'- NumPy version '1.21.5'
- Numba version '0.55.1'
I have followed the error into the function _cartesian_product
in sparse._umath.py
module.
There is a line:
rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
Which is calculating the number of indices that need to be added to other.coords, and puts it into the variable rows
.
Here is a screenshot from the debugger:
In this case, rows
needs to be 5698564*36*21 = 4,308,114,384
which is a little more than 2**32+2**23
. Let's call this "real_rows".
But, the returned value from np.prod is 13147088
which is exactly real_rows-2**32
.
I looked into np.prod
, and the default returned dtype is the dtype of the input array, which is the dtype of broadcasted[0].shape
, which is "int":
According to here the "int" is the default platform int32, and this is why np.prod is experiencing overflow.
How can I make the default dtype of np.ndarray.shape be int64?
Try dtype=np.int64
when creating the sparse array.
I have tried to replace this line in sparse._umath._cartesian_product
:
rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
with:
rows, cols = np.prod(broadcasted[0].shape, dtype=np.int64), len(broadcasted)
And the value returned to rows
is the correct value.
The problem this time was the next line of code:
out = np.empty(rows * cols, dtype=dtype)
Which needed 96 GiB of space for the allocation, and threw a memory error.
Bottom line: I created a data structure so big that it's indices needed more than 32 bits.
The sparse package doesn't really need to support indices this big because the array itself takes too much space.
I'll try to implement my code with a smaller data structure.
Thanks for your help.