Quadratic complexity of sparse matrices multiplication
espdev opened this issue · comments
Hello!
I try to use sprs package in my experimental code. I have encountered the issue of very slow multiplication of large sparse matrices.
My code uses large diagonal matrices in CSR format. The matrices might contain from hundreds/thousands to millions NNZ
elements. Multiplication of such matrices in sprs has almost quadratic complexity O(nnz^2)
. So we cannot use sprs in production code currently.
I wrote some tests in Rust+sprs and Python+scipy. SciPy uses SMMP algorithm (Sparse Matrix Multiplication Package) and perform multiplication such matrices in tens/hundreds of milliseconds.
The code of my benchmarks can be found on gist: https://gist.github.com/espdev/d278962828ce7360d653b8679d7bb511
Could you consider using SMMP algorithm in sprs? This algorithm has complexity O(n_row*K^2 + max(n_row,n_col))
where K is the maximum nnz in a row of A and column of B.
Thanks for investigating this! I have always wondered how sprs would compare to scipy.sparse performance wise scipy/scipy#10201 (comment) It would be interesting to also have this comparison random sparse arrays to evaluate the impact of shape and density.
That's a problem indeed, thanks for reporting it. I'll try and find some time to investigate.
I have compared sprs and scipy on random sparse matrices with some shapes and density. It is interesting. It seems I incorrectly defined the dependency sprs complexity from nnz. sprs mat*mat complexity directly depends on matrices sizes also. The performance is dramatically reduced while increasing the sizes of the matrices independently from nnz. At the same time in scipy the dependency is quite different.
Here are some plots from my benchmarks.
Benches code can be found at the gist: https://gist.github.com/espdev/f16f139f8007d3b37d1a57835fee6a19
Interesting. Here's the main loop of the matrix multiplication in sprs: https://github.com/vbarrielle/sprs/blob/master/src/sparse/prod.rs#L207-L226
It accumulates each row of the result matrix into a dense vector, before converting it to sparse. That explains the dependency on the dimension (actually, the number of columns). But it makes it more interesting when the matrix is denser.
I've started a branch to implement SMMP. We'll have more information when I'm done implementing it. But your graphs mean it might be interesting to keep the original product implementation around, but perhaps with a modification to sparsify the structure that accumulates the result row.
Progress report: I'm done implementing SMMP symbolic and numeric in the smmp branch. I won't have time to check the performance this weekend, but I should look at it next week.
I also have a request for you, @espdev: would you be ok with contributing your benchmarks in the examples folder of sprs? I'd like to have them around to be able to monitor performance, and I want to be sure you're fine with contributing them under sprs' license.
Wow! It's great. Thank you!
I will clean and refactor my benchmarks to contribute to sprs examples soon.
Here are the performance results for smmp on your first benchmark:
(EDIT: these results have a caveat, I generated them on my unplugged laptop).
Looks good, now I have to take a look at your second benchmark. This will probably show a complexity difference between scipy's implementation of SMMP and mine, I have to sort the indices of the matrices because that's a guarantee of the CsMat
class. That makes the complexity O(n_row*K^2 + max(n_row,n_col))*K*log(K)
. That should not matter for most workloads though.
Here are the results for the random matrices:
We can see that sprs is lagging behind now, I've profiled it, and most of the time is spent sorting the indices. If I disable indices sorting sprs is more performant:
I'm going to investigate alternative indices sorting methods. I can try using an array of booleans if the nnz count is too large.
I noticed the sizes of matrices are different in legend text on the graphs:
sprs (15000, 2500)
scipy (15000, 25000)
It looks like it is just a typo.
Indeed, I meant to type (15000, 25000)
for both.
Small update, I've figured I could dynamically switch between using the "linked list" indices from SMMP, followed by a sort, and using a linear scan of the encountered nnz indices, according to the expected amount of work.
This makes the performance much better when there is a good amount of non-zeros:
(EDIT: the legend is false once again, I forgot to remove the (unsorted)
, it is sorted...)
My test to switch between the two methods is a bit naive right now, but it seems to work well enough. In short, for each row, I compare the amount of elements I need to scan (difference between the max encountered nnz col and the min) versus the amount of work sorting would take (k * log(k) with k the number of nnz in the row. It's possible I should make some timings to determine a more precise bound. There are also a few optimizations that could be done for the linear scan, as it could take advantage of SIMD.
Quick mention that sparse product performance has been further improved in #201.