pydata / sparse

Sparse multi-dimensional arrays for the PyData ecosystem

Home Page:https://sparse.pydata.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory usage compared to scipy.sparse.coo_matrix

acycliq opened this issue · comments

Description
I was wondering if there is any insight as to why a sparse.COO seems to consume almost twice as much memory compared to a list of scipy.sparse.coo_matrices. Maybe I am missing something.....
For example the code below is copied from the getting started page of the documentation. I have just extended it to include a case where the dense 3D numpy array is converted into a list of coo_matrices and calculated the memory footprint

Example Code

import numpy as np
import sparse
from scipy.sparse import coo_matrix

x = np.random.random((100, 100, 100))
x[x < 0.9] = 0  # fill most of the array with zeros

s = sparse.COO(x)  # convert to sparse array

The size of of the dense array is 7.6MB
x.nbytes/(1024**2) = 7.629MB

The size of the COO is 3MB
s.nbytes/(1024**2) = 3.038MB

Make now a list of scipy.sparse.coo_matrices

sp = [coo_matrix(d) for d in x]
nbytes_list = [d.data.nbytes + d.row.nbytes + d.col.nbytes for d in sp]

The size of the list of 1.5MB
sum(nbytes_list)/(1024**2) = 1.519MB

In fact this can be simplified even further, compare a 2D COO with a scipy.sparse.coo_matrix (instead of a 3D COO vs a litst). For example:
sparse.COO(x[0]).nbytes = 23448
but
coo_matrix(x[0]).data.nbytes + coo_matrix(x[0]).row.nbytes + coo_matrix(x[0]).col.nbytes = 15632

Hello, this can be explained by the fact that the storage of indices in COO is np.intp instead of np.int32, allowing for larger sparse arrays to be stored. The following code snippet illustrates that:

>>> a = np.random.default_rng().random((100, 100, 100))
>>> a[a < 0.9] = 0
>>> s_list = [scipy.sparse.coo_matrix(d) for d in a]
>>> s = sparse.COO.from_numpy(a)
>>> s.nbytes / (1024**2)
3.03948974609375
>>> sum(d.data.nbytes + d.row.nbytes + d.col.nbytes for d in s_list) / (1024**2)
1.519744873046875
>>> s.coords.dtype
dtype('int64')
>>> (s.coords.dtype, s.data.dtype)
(dtype('int64'), dtype('float64'))
>>> (s_list[0].row.dtype, s_list[0].col.dtype, s_list[0].data.dtype)
(dtype('int32'), dtype('int32'), dtype('float64'))

We did have routines to compress the dtype of s.coords to the smallest possible for the array, but in many cases this led to overflows and bugs, due to which we reverted it to being np.intp only. As part of the work under #618, we are planning to let the user have more control over this in the compiler backend.