ValueError: array size defined by dims is larger than the maximum possible size

Question

ValueError: array size defined by dims is larger than the maximum possible size

khaeru opened this issue 3 years ago · comments

Paul Natsuo Kishimoto commented 3 years ago

Describe the bug
Creating sparse.COO fails with a ValueError in np.ravel_multi_index().

To Reproduce
Steps to reproduce the behavior.

from itertools import zip_longest

import numpy as np
import pandas as pd
import xarray as xr

# Dimensions and their lengths (Fibonacci numbers)
N_dims = 7  # largest dimension has length 75025
dims = "abcdefghi"[: N_dims + 1]
sizes = [2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393][: N_dims + 1]

# Names like f_0000 ... f_1596 along each dimension
coords = []
for d, N in zip(dims, sizes):
    coords.append([f"{d}_{i:04d}" for i in range(N)])

# Random values
values = list(zip_longest(*coords, np.random.rand(max(sizes))))

data = (
    pd.DataFrame(values, columns=list(dims) + ["value"])
    .ffill()
    .set_index(list(dims))["value"]
)

xr.DataArray.from_series(data, sparse=True)

Expected behavior
The final line completes successfully, returning a xr.DataArray backed by sparse.COO with nnz=75025.

Observed

$ python bug.py 
75025
Traceback (most recent call last):
  File "bug.py", line 28, in <module>
    xr.DataArray.from_series(data, sparse=True)
  File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataarray.py", line 2765, in from_series
    ds = Dataset.from_dataframe(df, sparse=sparse)
  File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 4947, in from_dataframe
    obj._set_sparse_data_from_dataframe(idx, arrays, dims)
  File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 4839, in _set_sparse_data_from_dataframe
    data = COO(
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 275, in __init__
    self._sort_indices()
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 1556, in _sort_indices
    linear = self.linear_loc()
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 1260, in linear_loc
    return linear_loc(self.coords, self.shape)
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/common.py", line 63, in linear_loc
    return np.ravel_multi_index(coords, shape)
  File "<__array_function__ internals>", line 5, in ravel_multi_index
ValueError: invalid dims: array size defined by dims is larger than the maximum possible size.

System

OS and version: Ubuntu 20.10.
sparse, NumPy, Numba versions:

$ pip list | grep "sparse\|numba\|numpy"
numba                         0.52.0
numpy                         1.20.0
numpydoc                      1.1.0
sparse                        0.11.2

Hameer Abbasi · Answer 1 · Thu Feb 04 2021 19:53:56 GMT+0800 (China Standard Time)

Hello. It is an internal restriction that np.prod(shape) has to fit in a np.int64. Unfortunately, your array doesn't fit the bill, and there is no reasonable way to fix it either.

Paul Natsuo Kishimoto · Answer 2 · Thu Feb 04 2021 20:33:50 GMT+0800 (China Standard Time)

Thanks for that information. If I add a line like:

print(
    np.prod(sizes),
    np.iinfo(np.int64),
    np.prod(sizes) < np.iinfo(np.int64).max,
    sep="\n",
)

I get output like:

4620137613745277440
Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

True

Is shape in your comment something other than the lengths of each dimension in my array?

Hameer Abbasi · Answer 3 · Thu Feb 04 2021 20:38:10 GMT+0800 (China Standard Time)

There is an overflow happening, it doesn't really "fit". If we use plain old Python, which uses BigInts:

>>> sizes = [2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393]
>>> size = 1
>>> for s in sizes:
...     size *= s
...
>>> print(size)
171469839194386170211374144879246912000
>>>

Which is clearly larger than the max for np.int64 (to be pedantic, it's a np.intp, aliased to np.int64 on 64-bit architectures).

Paul Natsuo Kishimoto · Answer 4 · Thu Feb 04 2021 20:42:02 GMT+0800 (China Standard Time)

Great! Thanks for clarifying. I can use this in my code to throw more user-friendly exceptions if a user tries to do something like this.

Wenyin Wei · Answer 5 · Sat Oct 15 2022 19:23:33 GMT+0800 (China Standard Time)

Do we now have any idea to solve it? I have a super-sparse and super-high-dimensional tensor. This bug seems to be hard to work around.

Hameer Abbasi · Answer 6 · Sat Oct 15 2022 22:24:51 GMT+0800 (China Standard Time)

Hello! At the moment no. We are working on xsparse which might solve it partly, by treating dimensions separately as well as BigInt support, but there isn't an easy fix in pydata/sparse.