hdbscan

Question

hdbscan

jwijffels opened this issue 4 years ago · comments

I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek
This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using uwot::umap after which dbscan::hdbscan is applied to find clusters.
When trying this out on a corpus with approximately 50000 documents, this fails in the call of dist in the call to hdbscan when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue #35)

> library(dbscan)
> docs_umap <- matrix(rnorm(50000*2), ncol = 2)
> cl <- dbscan::hdbscan(docs_umap, minPts = 15L)
Error in dist(x, method = "euclidean") : 
  negative length vectors are not allowed
> cl <- dbscan::hdbscan(head(docs_umap, 10000), minPts = 15L)
> str(cl$cluster)
 num [1:10000] 0 4 4 4 0 4 4 0 4 4 ...

Michael Hahsler · Answer 1 · Wed Feb 17 2021 04:46:54 GMT+0800 (China Standard Time)

hdbscan needs to compute a minimum spanning tree (MST) on the mutual reachability matrix (which is calculated from the distance matrix). What we would need is a way to go from the data directly to the MST without storing the whole distance/mutual reachability matrix for at least Euclidean distance. I am not quite sure how to do that... Ideas?

jwijffels · Answer 2 · Wed Feb 17 2021 19:01:32 GMT+0800 (China Standard Time)

Initially I thought that this could have been covered with some bigmemory backend or even altrep but probably there exists smarter ways. I should probably have a look more in detail to the mutual reachability matrix calculation (https://github.com/mhahsler/dbscan/blob/master/src/mrd.cpp#L6) before I can provide you with ideas.