ValueError: array size defined by dims is larger than the maximum possible size
khaeru opened this issue · comments
Describe the bug
Creating sparse.COO fails with a ValueError in np.ravel_multi_index().
To Reproduce
Steps to reproduce the behavior.
from itertools import zip_longest
import numpy as np
import pandas as pd
import xarray as xr
# Dimensions and their lengths (Fibonacci numbers)
N_dims = 7 # largest dimension has length 75025
dims = "abcdefghi"[: N_dims + 1]
sizes = [2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393][: N_dims + 1]
# Names like f_0000 ... f_1596 along each dimension
coords = []
for d, N in zip(dims, sizes):
coords.append([f"{d}_{i:04d}" for i in range(N)])
# Random values
values = list(zip_longest(*coords, np.random.rand(max(sizes))))
data = (
pd.DataFrame(values, columns=list(dims) + ["value"])
.ffill()
.set_index(list(dims))["value"]
)
xr.DataArray.from_series(data, sparse=True)
Expected behavior
The final line completes successfully, returning a xr.DataArray backed by sparse.COO with nnz=75025.
Observed
$ python bug.py
75025
Traceback (most recent call last):
File "bug.py", line 28, in <module>
xr.DataArray.from_series(data, sparse=True)
File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataarray.py", line 2765, in from_series
ds = Dataset.from_dataframe(df, sparse=sparse)
File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 4947, in from_dataframe
obj._set_sparse_data_from_dataframe(idx, arrays, dims)
File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 4839, in _set_sparse_data_from_dataframe
data = COO(
File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 275, in __init__
self._sort_indices()
File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 1556, in _sort_indices
linear = self.linear_loc()
File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 1260, in linear_loc
return linear_loc(self.coords, self.shape)
File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/common.py", line 63, in linear_loc
return np.ravel_multi_index(coords, shape)
File "<__array_function__ internals>", line 5, in ravel_multi_index
ValueError: invalid dims: array size defined by dims is larger than the maximum possible size.
System
- OS and version: Ubuntu 20.10.
sparse
, NumPy, Numba versions:
$ pip list | grep "sparse\|numba\|numpy"
numba 0.52.0
numpy 1.20.0
numpydoc 1.1.0
sparse 0.11.2
Hello. It is an internal restriction that np.prod(shape)
has to fit in a np.int64
. Unfortunately, your array doesn't fit the bill, and there is no reasonable way to fix it either.
Thanks for that information. If I add a line like:
print(
np.prod(sizes),
np.iinfo(np.int64),
np.prod(sizes) < np.iinfo(np.int64).max,
sep="\n",
)
I get output like:
4620137613745277440
Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------
True
Is shape
in your comment something other than the lengths of each dimension in my array?
There is an overflow happening, it doesn't really "fit". If we use plain old Python, which uses BigInts:
>>> sizes = [2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393]
>>> size = 1
>>> for s in sizes:
... size *= s
...
>>> print(size)
171469839194386170211374144879246912000
>>>
Which is clearly larger than the max
for np.int64
(to be pedantic, it's a np.intp
, aliased to np.int64
on 64-bit architectures).
Great! Thanks for clarifying. I can use this in my code to throw more user-friendly exceptions if a user tries to do something like this.
Do we now have any idea to solve it? I have a super-sparse and super-high-dimensional tensor. This bug seems to be hard to work around.
Hello! At the moment no. We are working on xsparse which might solve it partly, by treating dimensions separately as well as BigInt support, but there isn't an easy fix in pydata/sparse.