pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`GCXS` matmul => slice leads to incorrect results

matbryan52 opened this issue · comments

Describe the bug
Performing matmul @ between two GCXS arrays then slicing the output can lead to incorrect results, most commonly an array of all zeros.

To Reproduce

import numpy as np
from sparse import GCXS
# Two matmul compatible arrays with no zeros
left = GCXS.from_numpy(np.ones((4, 16)))
right = GCXS.from_numpy(np.ones((16, 2)))
dotprod = left @ right
assert dotprod.nnz == dotprod.size  # result all nonzeros
dotprod_dense = dotprod.todense()
# Slice to get the first row of the result
print(dotprod_dense[0])
# > [16. 16.]  # correct result, reduced dimension was 16
print(dotprod[0].todense())
# > [0. 0.]  # sliced result is all zeros
assert all(dotprod[i].nnz == 0 for i in range(dotprod.shape[0]))  # Every row is empty !

Expected behavior
Slicing the sparse result should be consistent with slicing the dense result.

System

  • OS and version: Ubuntu 20
  • sparse version 0.14.0
  • NumPy version 1.24.3
  • Numba version 0.57.0

Additional context
I wonder if there is a link to #602 which also refers to slicing? The result array has an unsorted indices (array([1, 0, 1, 0, 1, 0, 1, 0]) with indptr array([0, 2, 4, 6, 8])).

Hey @matbryan52, I went through the observation you shared. I believe the problem might be related to left @ right.

Updating the indices after the dot product resulted in the following responses which might be inline with our expected result:

Output - 01

import numpy as np
from sparse import GCXS
import sparse as sp

left = GCXS.from_numpy(np.ones((4, 16)))
right = GCXS.from_numpy(np.ones((16, 2)))
tmp1 = GCXS.from_numpy(16*np.ones((4, 2)), fill_value=0, compressed_axes=(0,))
tmp1.fill_value = 0

dotprod = left @ right

dotprod.indices = tmp1.indices

print('-'*90)
print(tmp1)
print(dotprod)
print('-'*90)

print(tmp1.data)
print(dotprod.data)
print('-'*90)

x = tmp1[0]
y = dotprod[0]

print(x.data)
print(y.data)
print('-'*90)

print(x.data.shape)
print(y.data.shape)
print('-'*90)

print(x)
print(y)
print('-'*90)

print(x.data)
print(y.data)
print('-'*90)

print(vars(tmp1))
print(vars(dotprod))
print('-'*90)

Output - 01

------------------------------------------------------------------------------------------
<GCXS: shape=(4, 2), dtype=float64, nnz=8, fill_value=0, compressed_axes=(0,)>
<GCXS: shape=(4, 2), dtype=float64, nnz=8, fill_value=0, compressed_axes=(0,)>
------------------------------------------------------------------------------------------
[16. 16. 16. 16. 16. 16. 16. 16.]
[16. 16. 16. 16. 16. 16. 16. 16.]
------------------------------------------------------------------------------------------
[16. 16.]
[16. 16.]
------------------------------------------------------------------------------------------
(2,)
(2,)
------------------------------------------------------------------------------------------
<GCXS: shape=(2,), dtype=float64, nnz=2, fill_value=0, compressed_axes=None>
<GCXS: shape=(2,), dtype=float64, nnz=2, fill_value=0, compressed_axes=None>
------------------------------------------------------------------------------------------
[16. 16.]
[16. 16.]
------------------------------------------------------------------------------------------
{'data': array([16., 16., 16., 16., 16., 16., 16., 16.]), 'indices': array([0, 1, 0, 1, 0, 1, 0, 1]), 'indptr': array([0, 2, 4, 6, 8]), 'shape': (4, 2), '_compressed_axes': (0,), 'fill_value': 0}
{'data': array([16., 16., 16., 16., 16., 16., 16., 16.]), 'indices': array([0, 1, 0, 1, 0, 1, 0, 1]), 'indptr': array([0, 2, 4, 6, 8]), 'shape': (4, 2), '_compressed_axes': (0,), 'fill_value': 0}
------------------------------------------------------------------------------------------

while without replacing the indices I got this response .

Output - 02

------------------------------------------------------------------------------------------
<GCXS: shape=(4, 2), dtype=float64, nnz=8, fill_value=0, compressed_axes=(0,)>
<GCXS: shape=(4, 2), dtype=float64, nnz=8, fill_value=0, compressed_axes=(0,)>
------------------------------------------------------------------------------------------
[16. 16. 16. 16. 16. 16. 16. 16.]
[16. 16. 16. 16. 16. 16. 16. 16.]
------------------------------------------------------------------------------------------
[16. 16.]
[]
------------------------------------------------------------------------------------------
(2,)
(0,)
------------------------------------------------------------------------------------------
<GCXS: shape=(2,), dtype=float64, nnz=2, fill_value=0, compressed_axes=None>
<GCXS: shape=(2,), dtype=float64, nnz=0, fill_value=0, compressed_axes=None>
------------------------------------------------------------------------------------------
[16. 16.]
[]
------------------------------------------------------------------------------------------
{'data': array([16., 16., 16., 16., 16., 16., 16., 16.]), 'indices': array([0, 1, 0, 1, 0, 1, 0, 1]), 'indptr': array([0, 2, 4, 6, 8]), 'shape': (4, 2), '_compressed_axes': (0,), 'fill_value': 0}
{'data': array([16., 16., 16., 16., 16., 16., 16., 16.]), 'indices': array([1, 0, 1, 0, 1, 0, 1, 0]), 'indptr': array([0, 2, 4, 6, 8]), 'shape': (4, 2), '_compressed_axes': (0,), 'fill_value': 0}

would like to hear your take on this, can I pick this issue up ?

Anyone is welcome to pick up the issue. 😉

Hi everyone, as @SudhanshuJoshi09 suggested the problem is related to left @ right.

Specifically, the method _dot_csr_csr in _common.py does not return ordered arrays of, neither indices nor data (This is explicitly stated in https://github.com/scipy/scipy/blob/master/scipy/sparse/sparsetools/csr.h, where the code for _dot_csr_csr has been borrowed from).

It seems that the algorithm returns indices and data in reversed order. Inverting only indices leads to incorrect results once todense() is called. If also data are inverted, everything seems correct.

Code

import numpy as np
from sparse import GCXS

left = GCXS.from_numpy(np.array([[1,2], [1,2]]))
right = GCXS.from_numpy(np.array(([[1,2,3], [4,5,6]])))
dotprod = left @ right

print("----------- dotprod before before inverting indices--------------")
print(f"Data: {dotprod.data}")
print(f"Indices: {dotprod.indices}")
print(f"Dense Matrix:\n{dotprod.todense()}")

print("----------- sliced vector before inverting indices-----------")
print(f"Data: {dotprod[0].data}")
print(f"Indices: {dotprod[0].indices}")

# inverting indices
ord_indices = dotprod.indices
dotprod.indices = ord_indices[::-1]

print("----------- sliced vector after inverting indices-----------")
print(f"Data: {dotprod[0].data}")
print(f"Indices: {dotprod[0].indices}")

print("----------- wrong dense matrix-----------")
print(f"Dense matrix of dotprod:\n{dotprod.todense()}")

print("----------- wrong dense vector-----------")
print(f"Dense matrix of dotprod[0]:\n{dotprod[0].todense()}")

#inverting also data
ord_data = dotprod.data
dotprod.data = ord_data[::-1]


print("----------- dense matrix after inverting also data--------")
print(f"Dense matrix of dotprod:\n{dotprod.todense()}")

print("----------- dense sliced vector after inverting also data--------")
print(f"Dense matrix of dotprod[0]:\n{dotprod[0].todense()}")

Output

----------- dotprod before before inverting indices--------------
Data: [15 12  9 15 12  9]
Indices: [2 1 0 2 1 0]
Dense Matrix:
[[ 9 12 15]
 [ 9 12 15]]
----------- sliced vector before inverting indices-----------
Data: []
Indices: []
----------- sliced vector after inverting indices-----------
Data: [15 12  9]
Indices: [0 1 2]
----------- wrong dense matrix-----------
Dense matrix of dotprod:
[[15 12  9]
 [15 12  9]]
----------- wrong dense vector-----------
Dense matrix of dotprod[0]:
[15 12  9]
----------- dense matrix after inverting also data--------
Dense matrix of dotprod:
[[ 9 12 15]
 [ 9 12 15]]
----------- dense sliced vector after inverting also data--------
Dense matrix of dotprod[0]:
[ 9 12 15]

So, the easiest solution would be to reverse indices and data in _dot_csr_csr before returning them. I don't know If this approach is the best regarding performance. What are your thoughts about that?

Thanks @EuGig for the detailed analysis. I'd be willing to accept a PR if you make one.