pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sparse.COO.broadcast_to ValueError for COO with dense size bigger than 2**32-1

mayalinetsky-kryon opened this issue · comments

Bug Description
When broadcasting a sparse matrix <COO: shape=(6318, 35, 1, 1, 17806), dtype=uint8, nnz=5698564, fill_value=0> to shape (6318,35,36,21,17806) I get:
ValueError: could not broadcast input array from shape (4308114384,) into shape (13147088,)

To Reproduce

import random, sparse
# Create a 5-dimensional sparse matrix with about [nnz] non-zero values
nnz =5698564
shape = (6318, 35, 1, 1, 17806)
data = []
coords = [[],[],[],[],[]]
for i,j,k,m,z in zip([random.randint(0,shape[0]-1) for _ in range(0,nnz)],[random.randint(0,shape[1]-1) for _ in range(0,nnz)],[random.randint(0,shape[2]-1) for _ in range(0,nnz)],[random.randint(0,shape[3]-1) for _ in range(0,nnz)],[random.randint(0,shape[4]-1) for _ in range(0,nnz)]):
    coords[0].append(i)
    coords[1].append(j)
    coords[2].append(k)
    coords[3].append(m)
    coords[4].append(z)
    data.append(1)

matrix = sparse.COO(coords,data,shape = shape,fill_value=0)

# Broadcast to shape (6318,35,36,21,17806)
matrix_broadcasted = matrix.broadcast_to((6318,35,36,21,17806))

Expected behavior
I expected that matrix_broadcasted would be with shape (6318,35,36,21,17806) and would hold
matrix_broadcasted[:,:,k,m,:] is equal to matrix[:,:,0,0,:] for every pair index of k,m.

System

  • OS and version: Windows 10 Pro Version 21H1
  • sparse version '0.13.0'
  • NumPy version '1.21.5'
  • Numba version '0.55.1'

I have followed the error into the function _cartesian_product in sparse._umath.py module.
There is a line:
rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
Which is calculating the number of indices that need to be added to other.coords, and puts it into the variable rows.

Here is a screenshot from the debugger:
image

In this case, rows needs to be 5698564*36*21 = 4,308,114,384 which is a little more than 2**32+2**23. Let's call this "real_rows".
But, the returned value from np.prod is 13147088 which is exactly real_rows-2**32.

I looked into np.prod, and the default returned dtype is the dtype of the input array, which is the dtype of broadcasted[0].shape, which is "int":
image

According to here the "int" is the default platform int32, and this is why np.prod is experiencing overflow.

How can I make the default dtype of np.ndarray.shape be int64?

Try dtype=np.int64 when creating the sparse array.

I have tried to replace this line in sparse._umath._cartesian_product:
rows, cols = np.prod(broadcasted[0].shape), len(broadcasted)
with:
rows, cols = np.prod(broadcasted[0].shape, dtype=np.int64), len(broadcasted)
And the value returned to rows is the correct value.
The problem this time was the next line of code:
out = np.empty(rows * cols, dtype=dtype)
Which needed 96 GiB of space for the allocation, and threw a memory error.

Bottom line: I created a data structure so big that it's indices needed more than 32 bits.
The sparse package doesn't really need to support indices this big because the array itself takes too much space.
I'll try to implement my code with a smaller data structure.

Thanks for your help.