Much Slower than Scikit-Learn kMedoids on Large Dimensions

Question

Much Slower than Scikit-Learn kMedoids on Large Dimensions

kayuksel opened this issue 2 years ago · comments

Hello!

We have a problem where we need to select a subset of thousand items
from hundred items using k-Medoids clustering of embeddings (512 dim).

We were using k-Medoids implementation in Scikit-Learn. We tried BanditPAM
recently as we thought it to be a much faster method but that wasn't the case.

We wanted to ask here to make sure before we eliminate it from our options.
We are looking forward to your suggestions on what we may be doing wrong.

Have a nice day.

Sincerely,
Kamer

Mo Tiwari · Answer 1 · Wed Jan 19 2022 01:22:12 GMT+0800 (China Standard Time)

Thanks for the report, @kayuksel . Looks similar to #175 . I will investigate shortly and report back.

Erich Schubert · Answer 2 · Sun Jan 23 2022 22:51:13 GMT+0800 (China Standard Time)

@kayuksel your probably refer the the sklearn-extra implementation - not part of the regular sklearn, and not as intensively tested or maintained as sklearn.

It by default uses the Alternating algorithm, which produces worse results. Try fasterpam in the kmedoids Python package instead (and compare the resulting loss!) as a slight improvement - although still N², so it will not work for millions of samples.