pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can't take max of arrays at least as large as 2 ** 32

wecassidy opened this issue · comments

Describe the bug
Calling sparse.COO.max on an array larger than 2 ** 32 - 1 fails a TypeError like so:

>>> a.shape
(4294967296,)
>>> a.max()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 444, in max
    return np.maximum.reduce(self, out=out, axis=axis, keepdims=keepdims)
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 307, in __array_ufunc__
    result = SparseArray._reduce(ufunc, *inputs, **kwargs)
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 278, in _reduce
    return self.reduce(method, **kwargs)
  File "C:\<path_redacted>\sparse\_sparse_array.py", line 360, in reduce
    out = self._reduce_calc(method, axis, keepdims, **kwargs)
  File "C:\<path_redacted>\sparse\_coo\core.py", line 692, in _reduce_calc
    data, inv_idx, counts = _grouped_reduce(a.data, a.coords[0], method, **kwargs)
  File "C:\<path_redacted>\sparse\_coo\core.py", line 1566, in _grouped_reduce
    result = method.reduceat(x, inv_idx, **kwargs)
TypeError: Cannot cast array data from dtype('uint64') to dtype('int64') according to the rule 'safe'

To Reproduce
Create an array a at least as large as 2 ** 32 with at least one nonzero element, then call a.max(). For example:

>>> b = sparse.DOK((2 ** 32,))
>>> b[0] = 1
>>> a = sparse.COO(b)
>>> a.nnz
1
>>> a.max() # TypeError

Expected behavior
Return the maximum value of the array (1 in the example above).

System

  • OS and version: Windows 10
  • sparse version: 0.12.0+44.g765e297 (bug is also present in 0.12.0, installed from pip)
  • NumPy version: 1.18.5
  • Numba version: 0.53.1

Additional context
sparse.COO.max works on an array of size 2 ** 32 if it is empty (i.e. a.nnz == 0).

Are you on 32-bit Windows by any chance?

I'm on 64-bit Windows.

I just checked and this bug is not present on Manjaro 21.0.7 with Linux 5.12.9-1-MANJARO (x86_64).

Mentoring instructions: Replace all uses of np.[as]array(list) with np.[as]array(list, dtype=np.int64).

Hello, I ran into the same problem. Was there any solution to this?

A quick update since I'm now digging into the library. I see that there is an idx_dtype parameter for the constructor of COO that -I believe- should force COO to use a specific type as index format. However, if data is None in the constructor's call the array is converted via as_coo, which in turn relies on DOK's as_format, which here calls COO.from_iter, which doesn't take the idx_dtype and doesn't forward it to the final call to COO's constructor here.

The result is, effectively, that idx_dtype gets ignored.

A proposal for improving this would be:

  • as_coo should take idx_dtype (and possibly more parameters of the constructor, maybe directly **kwargs?) anf forward them down as appropriate.
  • as_format should take **kwargs and should forward them to whichever constructor/factory it uses internally
  • from_iter should take **kwargs and forward them to the COO constructor.

I don't know which, if any, parameter combinations should be forbidden to ensure there is no infinite recursion in the constructor, but I believe someone with more knowledge of the codebase might know what and where to check so this doesn't happen.

I traced the issue to its source and came up with a hack to make this work, should anyone else also run into this problem.
Basically, when this reshape is called, because idx_type is ignored, as mentioned in the comment above, it uses the default int32 idx_type. Since in32 can't store the new shape, this test checks positive and idx_type gets converted to the result of np.min_scalar_type(max(shape)), which is np.uint64 and that's what causes the problem.

My hack to solve this is to hardcode np.int64 instead of letting numpy choose:

idx_type = np.int64

This solves the problem when calling max().

Thanks @GPhilo for digging into this, I'll try to set some time aside this weekend to fix it and cut a release.

It has been more than 2 years and this issue seems still exists. Any update on this?

This doesn't happen anymore on sparse 0.15.1, which is the latest release. Closing.