ing-bank / sparse_dot_topn

Python package to accelerate the sparse matrix multiplication and top-n similarity selection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sparse_dot_topn

-------------------------------WARNING-------------------------------

Version 1.0 introduces major and potentially breaking changes to the API.

Please see the Migrating section below.

-------------------------------WARNING-------------------------------

sparse_dot_topn provides a fast way to performing a sparse matrix multiplication followed by top-n multiplication result selection.

Comparing very large feature vectors and picking the best matches, in practice often results in performing a sparse matrix multiplication followed by selecting the top-n multiplication results.

sparse_dot_topn provides a (parallelised) sparse matrix multiplication implementation that integrates selecting the top-n values, resulting in a significantly lower memory footprint and improved performance. On Apple M2 Pro over two 20k x 193k TF-IDF matrices sparse_dot_topn can be up to 6 times faster when retaining the top 10 values per row and utilising 8 cores. See the benchmark directory for details.

Usage

sp_matmul_topn supports {CSR, CSC, COO} matrices with {32, 64}bit {int, float} data. Note that COO and CSC inputs are converted to the CSR format and are therefore slower. Two options to further reduce memory requirements are threshold and density. Optionally, the values can be sorted such that the first column for a given row contains the largest value. Note that sp_matmul_topn(A, B, top_n=B.shape[1]) is equal to sp_matmul(A, B) and A.dot(B).

import scipy.sparse as sparse
from sparse_dot_topn import sp_matmul, sp_matmul_topn

A = sparse.random(1000, 100, density=0.1, format="csr")
B = sparse.random(100, 2000, density=0.1, format="csr")

# Compute C and retain the top 10 values per row
C = sp_matmul_topn(A, B, top_n=10)

# or paralleslised matrix multiplication without top-n selection
C = sp_matmul(A, B, n_threads=2)
# or with top-n selection
C = sp_matmul_topn(A, B, top_n=10, n_threads=2)

# If you are only interested in values above a certain threshold
C = sp_matmul_topn(A, B, top_n=10, threshold=0.8)

# If you set the threshold we cannot easily determine the number of non-zero
# entries beforehand. Therefore, we allocate memory for `ceil(top_n * A.shap[0] * density)`
# non-zero entries. You can set the expected density to reduce the amount pre-allocated
# entries. Note that if we allocate too little an expensive copy(ies) will need to hapen.
C = sp_matmul_topn(A, B, top_n=10, threshold=0.8, density=0.1)

Installation

sparse_dot_topn provides wheels for CPython 3.8 to 3.12 for:

  • Windows (64bit)
  • Linux (64bit)
  • MacOS (x86 and ARM)
pip install sparse_dot_topn

sparse_dot_topn relies on a C++ extension for the computationally intensive multiplication routine. Note that the wheels vendor/ships OpenMP with the extension to provide parallelisation out-of-the-box. If you run into issues with OpenMP see INSTALLATION.md for help.

Installing from source requires a C++17 compatible compiler. If you have a compiler available it is advised to install without the wheel as this enables architecture specific optimisations.

You can install from source using:

pip install sparse_dot_topn --no-binary sparse_dot_topn

Build configuration

sparse_dot_topn provides some configuration options when building from source. Building from source can enable architecture specific optimisations and is recommended for those that have a C++ compiler installed. See INSTALLATION.md for details.

Migrating to v1.

sparse_dot_topn v1 is a significant change from v0.* with a new bindings and API. The new version adds support for CPython 3.12 and now supports both ints as well as floats. Internally we switched to a max-heap to collect the top-n values which significantly reduces memory-footprint. The former implementation had O(n_columns) complexity for the top-n selection where we now have O(top-n) complexity. awesome_cossim_topn has been deprecated and will be removed in a future version.

Users should switch to sp_matmul_topn which is largely compatible:

For example:

C = awesome_cossim_topn(A, B, ntop=10)

can be replicated using:

C = sp_matmul_topn(A, B, top_n=10, threshold=0.0, sort=True)

API changes

  1. ntop has been renamed to topn
  2. lower_bound has been renamed to threshold
  3. use_threads and n_jobs have been combined into n_threads
  4. return_best_ntop option has been removed
  5. test_nnz_max option has been removed
  6. B is auto-transposed when its shape is not compatible but its transpose is.

The output of return_best_ntop can be replicated with:

C = sp_matmul_topn(A, B, top_n=10)
best_ntop = np.diff(C.indptr).max()

Default changes

  1. threshold no longer 0.0 but disabled by default

This enables proper functioning for matrices that contain negative values. Additionally a different data-structure is used internally when collecting non-zero results that has a much lower memory-footprint than previously. This means that the effect of the threshold parameter on performance and memory requirements is negligible. If the threshold is None we pre-compute the number of non-zero entries, this can significantly reduce the required memory at a mild (~10%) performance penalty.

  1. sort = False, the result matrix is no longer sorted by default

The matrix is returned with the same column order as if not filtering of the top-n results has taken place. This means that when you set top_n equal to the number of columns of B you obtain the same result as normal multiplication, i.e. sp_matmul_topn(A, B, top_n=B.shape[1]) is equal to A.dot(B).

Contributing

Contributions are very welcome, please see CONTRIBUTING for details.

Contributors

This package was developed and is maintained by authors (previously) affiliated with ING Analytics Wholesale Banking Advanced Analytics. The original implementation was based on modified version of Scipy's CSR multiplication implementation. You can read about it in a blog (mirror) written by Zhe Sun.

About

Python package to accelerate the sparse matrix multiplication and top-n similarity selection

License:Apache License 2.0


Languages

Language:C++ 58.8%Language:Python 27.6%Language:CMake 13.2%Language:Shell 0.4%