motiwari / BanditPAM

BanditPAM C++ implementation and Python package

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Much Slower than Scikit-Learn kMedoids on Large Dimensions

kayuksel opened this issue · comments

Hello!

We have a problem where we need to select a subset of thousand items
from hundred items using k-Medoids clustering of embeddings (512 dim).

We were using k-Medoids implementation in Scikit-Learn. We tried BanditPAM
recently as we thought it to be a much faster method but that wasn't the case.

We wanted to ask here to make sure before we eliminate it from our options.
We are looking forward to your suggestions on what we may be doing wrong.

Have a nice day.

Sincerely,
Kamer

Thanks for the report, @kayuksel . Looks similar to #175 . I will investigate shortly and report back.

@kayuksel your probably refer the the sklearn-extra implementation - not part of the regular sklearn, and not as intensively tested or maintained as sklearn.

It by default uses the Alternating algorithm, which produces worse results. Try fasterpam in the kmedoids Python package instead (and compare the resulting loss!) as a slight improvement - although still N², so it will not work for millions of samples.