Teichlab / bbknn

Batch balanced KNN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

considering using `pynndescent` for appox nn search?

giovp opened this issue · comments

no more annoy dependency, pynndescent is also default for umap, and it performs much better

https://github.com/lmcinnes/pynndescent

image

This is a pretty good idea actually, thanks for the suggestion - I've been annoyed by annoy for a while as the most recent version throws segfaults all over the place. This also resolves the whole angular metric discrepancy between annoy and UMAP. Need to see how easy this is to install and how efficient index construction is (I remember some of the advertised fast query approaches spending forever crafting theirs, negating the point), but I'm hopeful!

What do you mean by this being a UMAP default by the way?

right, had those segfaults issues just recently 😅

What do you mean by this being a UMAP default by the way?

since umap 0.5 pynndescent is a hard dependency
https://umap-learn.readthedocs.io/en/latest/release_notes.html#what-s-new-in-0-5

Just dropping a line that I haven't forgotten about this, but I've had other stuff on my plate recently. I've finally done the long-overdue BBKNN refactor in preparation for this, creating a matrix division like in RBCDE and MultiMAP.

I'll give pynndescent a shot at some point soon and very likely switch over.

I've finished this up, and it's live in 1.5.0.

Personally, I've found pynndescent run times weirdly slow. The pancreas data is 4x slower than annoy, and even 2x slower than cKDTree - an exact neighbour search algorithm. However, pynndescent does natively support a lot of metrics, and ingests custom functions too, so supporting it increases the package's functionality.