Memory usage compared to scipy.sparse.coo_matrix
acycliq opened this issue · comments
Description
I was wondering if there is any insight as to why a sparse.COO
seems to consume almost twice as much memory compared to a list of scipy.sparse.coo_matrices
. Maybe I am missing something.....
For example the code below is copied from the getting started
page of the documentation. I have just extended it to include a case where the dense 3D numpy array is converted into a list of coo_matrices and calculated the memory footprint
Example Code
import numpy as np
import sparse
from scipy.sparse import coo_matrix
x = np.random.random((100, 100, 100))
x[x < 0.9] = 0 # fill most of the array with zeros
s = sparse.COO(x) # convert to sparse array
The size of of the dense array is 7.6MB
x.nbytes/(1024**2) = 7.629MB
The size of the COO is 3MB
s.nbytes/(1024**2) = 3.038MB
Make now a list of scipy.sparse.coo_matrices
sp = [coo_matrix(d) for d in x]
nbytes_list = [d.data.nbytes + d.row.nbytes + d.col.nbytes for d in sp]
The size of the list of 1.5MB
sum(nbytes_list)/(1024**2) = 1.519MB
In fact this can be simplified even further, compare a 2D COO with a scipy.sparse.coo_matrix (instead of a 3D COO vs a litst). For example:
sparse.COO(x[0]).nbytes = 23448
but
coo_matrix(x[0]).data.nbytes + coo_matrix(x[0]).row.nbytes + coo_matrix(x[0]).col.nbytes = 15632
Hello, this can be explained by the fact that the storage of indices in COO
is np.intp
instead of np.int32
, allowing for larger sparse arrays to be stored. The following code snippet illustrates that:
>>> a = np.random.default_rng().random((100, 100, 100))
>>> a[a < 0.9] = 0
>>> s_list = [scipy.sparse.coo_matrix(d) for d in a]
>>> s = sparse.COO.from_numpy(a)
>>> s.nbytes / (1024**2)
3.03948974609375
>>> sum(d.data.nbytes + d.row.nbytes + d.col.nbytes for d in s_list) / (1024**2)
1.519744873046875
>>> s.coords.dtype
dtype('int64')
>>> (s.coords.dtype, s.data.dtype)
(dtype('int64'), dtype('float64'))
>>> (s_list[0].row.dtype, s_list[0].col.dtype, s_list[0].data.dtype)
(dtype('int32'), dtype('int32'), dtype('float64'))
We did have routines to compress the dtype of s.coords
to the smallest possible for the array, but in many cases this led to overflows and bugs, due to which we reverted it to being np.intp
only. As part of the work under #618, we are planning to let the user have more control over this in the compiler backend.