Cluster clusters
timyerg opened this issue · comments
Timur Yergaliyev commented
Hello!
Thank you for the tool.
I am dealing with very big dataset and trying to reduce memory requirements. I tried setting low_memory to True, parametric umap, PCA reduction and other stuff but still memory requirements are too high for my purposes.
I am working with features from different samples. Each sample may contain more than 10000 unique features.
Now my idea is:
- Cluster features within samples, each cluster should contain at least 100 features.
- Select representative feature for each cluster (I have algorithm for that based on fearure properties), or several features.
- Cluster representative features, pooling all samples, each cluster can be considered as cluster even with 1 feature.
- Reassign features from step 1 to clusters from step 3.
In that way I am hopping to deal with memory consumption.
Could you please give me your opinion on that approach? Like "better not to do it" or "may work"?
Best,