kno10 / python-kmedoids

Fast K-Medoids clustering in Python with FasterPAM

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Document memory requirements

j-adamczyk opened this issue · comments

The main disadvantage of k-medoids from scikit-learn-extra is precomputing all pairwise distances, which makes the memory requirement O(n^2). This renders KMedoids unusable on larger datasets.

I think the memory scalability should be documented in python-kmedoids.

EDIT: as far as I understand, this would albo be O(n^2) here, since distance matrix is expected. But still, explicitly documenting this would be nice.

Yes, memory requirements can be added to the documentation.

scikit-learn-extra also needs O(n²) memory, see notes in their documentation:

Since all pairwise distances are calculated and stored in memory for the duration of fit, the space complexity is O(n_samples ** 2).

and in the code: D = pairwise_distances(X, metric=self.metric)
https://github.com/scikit-learn-contrib/scikit-learn-extra/blob/627f97b011cb267828e89cdf9257e35f59b328e7/sklearn_extra/cluster/_k_medoids.py#L239

It is in the nature of this problem that it needs pairwise distances several times, and hence precomputation is usually preferred. If you use the Rust package, you could write your own implementation of the distance access function to compute distances on demand. In fact, with FasterPAM this may sometimes be okay if your distance function is cheap enough and you only need a few iterations. With an expensive distance function, say dynamic time warping, you will want to use a distance matrix.

For larger data sets, CLARA (with FasterPAM) and (Fast)CLARANS are an option, but we have not added these to the Rust/Python packages yet, because then we would need to depend on some other package for distance functions (and the supported distances might differ from sklearn unexpectely then).
CLARA is simply (Faster)PAM on a sample, then keep those medoids that were best on the entire data set - you can easily do this in a wrapper.
If you want to try larger data set, use FastCLARANS in ELKI for now.

I have added the memory requirements to the documentation, this will be in the next release.