pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: array size defined by dims is larger than the maximum possible size

khaeru opened this issue · comments

Describe the bug
Creating sparse.COO fails with a ValueError in np.ravel_multi_index().

To Reproduce
Steps to reproduce the behavior.

from itertools import zip_longest

import numpy as np
import pandas as pd
import xarray as xr

# Dimensions and their lengths (Fibonacci numbers)
N_dims = 7  # largest dimension has length 75025
dims = "abcdefghi"[: N_dims + 1]
sizes = [2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393][: N_dims + 1]

# Names like f_0000 ... f_1596 along each dimension
coords = []
for d, N in zip(dims, sizes):
    coords.append([f"{d}_{i:04d}" for i in range(N)])

# Random values
values = list(zip_longest(*coords, np.random.rand(max(sizes))))

data = (
    pd.DataFrame(values, columns=list(dims) + ["value"])
    .ffill()
    .set_index(list(dims))["value"]
)

xr.DataArray.from_series(data, sparse=True)

Expected behavior
The final line completes successfully, returning a xr.DataArray backed by sparse.COO with nnz=75025.

Observed

$ python bug.py 
75025
Traceback (most recent call last):
  File "bug.py", line 28, in <module>
    xr.DataArray.from_series(data, sparse=True)
  File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataarray.py", line 2765, in from_series
    ds = Dataset.from_dataframe(df, sparse=sparse)
  File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 4947, in from_dataframe
    obj._set_sparse_data_from_dataframe(idx, arrays, dims)
  File "/home/khaeru/.local/lib/python3.8/site-packages/xarray/core/dataset.py", line 4839, in _set_sparse_data_from_dataframe
    data = COO(
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 275, in __init__
    self._sort_indices()
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 1556, in _sort_indices
    linear = self.linear_loc()
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/core.py", line 1260, in linear_loc
    return linear_loc(self.coords, self.shape)
  File "/home/khaeru/.local/lib/python3.8/site-packages/sparse/_coo/common.py", line 63, in linear_loc
    return np.ravel_multi_index(coords, shape)
  File "<__array_function__ internals>", line 5, in ravel_multi_index
ValueError: invalid dims: array size defined by dims is larger than the maximum possible size.

System

  • OS and version: Ubuntu 20.10.
  • sparse, NumPy, Numba versions:
$ pip list | grep "sparse\|numba\|numpy"
numba                         0.52.0
numpy                         1.20.0
numpydoc                      1.1.0
sparse                        0.11.2

Hello. It is an internal restriction that np.prod(shape) has to fit in a np.int64. Unfortunately, your array doesn't fit the bill, and there is no reasonable way to fix it either.

Thanks for that information. If I add a line like:

print(
    np.prod(sizes),
    np.iinfo(np.int64),
    np.prod(sizes) < np.iinfo(np.int64).max,
    sep="\n",
)

I get output like:

4620137613745277440
Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

True

Is shape in your comment something other than the lengths of each dimension in my array?

There is an overflow happening, it doesn't really "fit". If we use plain old Python, which uses BigInts:

>>> sizes = [2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393]
>>> size = 1
>>> for s in sizes:
...     size *= s
...
>>> print(size)
171469839194386170211374144879246912000
>>>

Which is clearly larger than the max for np.int64 (to be pedantic, it's a np.intp, aliased to np.int64 on 64-bit architectures).

Great! Thanks for clarifying. I can use this in my code to throw more user-friendly exceptions if a user tries to do something like this.

Do we now have any idea to solve it? I have a super-sparse and super-high-dimensional tensor. This bug seems to be hard to work around.

Hello! At the moment no. We are working on xsparse which might solve it partly, by treating dimensions separately as well as BigInt support, but there isn't an easy fix in pydata/sparse.