pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Constructing GCXS from non-canonical `scipy.sparse.csr_matrix` results in wrong results

uellue opened this issue · comments

Describe the bug

GCXS seems to require a canonical CSR data structure, i.e. column indices sorted within a row and without duplicates. Slicing a GCXS array with non-canonical internal data structure gives wrong results. The requirement for canonical data structures is not documented: https://sparse.pydata.org/en/stable/generated/sparse.GCXS.html

scipy.sparse.csr_matrix doesn't require its data structure to be canonical. The GCXS constructor doesn't seem to check if a passed scipy.sparse.csr_matrix is canonical. In effect, constructing a GCXS array from a perfectly valid scipy.sparse.csr_matrix may create a broken GCXS array which gives wrong numerical results. Using the explicit GCXS.from_scipy_sparse() gives the same behavior.

To Reproduce

This example uses cupyx.scipy.sparse.csr_matrix since that produces non-canonical CSR data structures when constructed from dense arrays. This highlights the severity of this bug where normal use of the APIs of these packages gives numerically wrong results without causing any warning or error. In my understanding the bug is mostly in GCXS since scipy.sparse.csr_matrix and cupyx.scipy.sparse.csr_matrix work perfectly fine with non-canonical data structures.

import numpy as np
import cupy
import cupyx.scipy.sparse as csp
import sparse

# Just some data
source = np.random.random((23, 42))
# Reference slice
ref = source[2:4]

# We make a cupyx.scipy.sparse.csr_matrix from the source data
cupyx_sparse = csp.csr_matrix(cupy.array(source))
# It is not canonical, i.e. has duplicates or unsorted indices
print("Canonical?",  cupyx_sparse.has_canonical_format)

# We slice it
r1 = cupy.asnumpy(cupyx_sparse[2:4].todense())
# Seems to work correctly
print("max difference cupyx.scipy.sparse", np.max(np.abs(ref - r1)))

# We make a GCXS matrix by first transferring from the GPU and then
# constructing GCXS
gcxs = sparse.GCXS(cupyx_sparse.get())
# We slice it
r2 = gcxs[2:4].todense()
# It is wrong
print("max difference GCXS", np.max(np.abs(ref - r2)))

Edit: Include output

Canonical? False
max difference cupyx.scipy.sparse 0.0
max difference GCXS 0.9852408955033491

Expected behavior

The expected behavior can be split into two aspects, of which one is preventing a bug with serious impact on downstream code, and the other a feature request.

First, constructing a GCXS array from a valid scipy.sparse.csr_matrix that GCXS can't handle correctly should throw an error instead of producing a GCXS array that gives wrong results.

Second, it would be good if GCXS could be constructed successfully from a non-canonical scipy.sparse.csr_matrix.

System

  • OS and version: Win 11
  • sparse version (sparse.__version__ '0.14.0')
  • NumPy version (np.__version__ '1.24.4')
  • Numba version (numba.__version__ '0.57.1')
  • CuPy version (cupy.__version__ '12.1.0')

Additional context

At LiberTEM/sparseconverter#14 I am currently working on tests and workarounds. Perhaps the test matrix in https://github.com/LiberTEM/sparseconverter/blob/main/tests/test_sparseconverters.py could be useful here?

Thanks for the detailed report! Since this is silent invalid results, I'd be inclined to fix it soon.

Hello, what exactly do you mean by canonical here? Can this be reproduced without cupyx perhaps?

Hello, what exactly do you mean by canonical here? Can this be reproduced without cupyx perhaps?

It can be reproduced if the data within a row is not sorted by column index.