lmcinnes / umap

Uniform Manifold Approximation and Projection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cluster clusters

timyerg opened this issue · comments

Hello!
Thank you for the tool.
I am dealing with very big dataset and trying to reduce memory requirements. I tried setting low_memory to True, parametric umap, PCA reduction and other stuff but still memory requirements are too high for my purposes.
I am working with features from different samples. Each sample may contain more than 10000 unique features.
Now my idea is:

  1. Cluster features within samples, each cluster should contain at least 100 features.
  2. Select representative feature for each cluster (I have algorithm for that based on fearure properties), or several features.
  3. Cluster representative features, pooling all samples, each cluster can be considered as cluster even with 1 feature.
  4. Reassign features from step 1 to clusters from step 3.

In that way I am hopping to deal with memory consumption.

Could you please give me your opinion on that approach? Like "better not to do it" or "may work"?

Best,