comparative analysis of clustering performance

Question

comparative analysis of clustering performance

stat-hejia opened this issue 3 years ago · comments

Hi Dr. Schwartz,
TooManyCells seem to be a powerful and fast algorithm for cell clustering. For comparative analysis of clustering performance and scalability, I used ARI and Silhouette to evaluate the accuracy of too-many-cells,but the results were not good while the clumpiness show well. do you have a way or R code so that I can compare the two clustering method and get a good result? (Just like what you did in Figure 3 of your Nature Methods paper.)
Great appreciation to your time and looking forward to your feedback and insights.

Gregory Schwartz · Answer 1 · Tue Dec 08 2020 22:53:45 GMT+0800 (China Standard Time)

I don't know how you are defining your benchmark data set, but measures such as ARI assume the true clustering is known. For instance, if you are testing based on cell type then the true clustering would not necessarily be known, as it's unrealistic to assume all B cells, for instance, would belong to a single cluster as they are quite diverse. These measures don't take such features into account and thus harshly penalize splitting up such populations, which is why the clumpiness is probably indicating good performance. As the goal was not to get perfect cluster sizes but instead to see how well the splitting was (as it's a multi-resolution visualization that's not intended to be used solely for the leaves), we instead used Entropy, Purity, and NMI which are clustering validation indices which instead measure how "pure" each cluster was. We also used our rare population benchmarks (described in the paper) for another kind of test to see how well we can split two known rare populations from each other.

You can check out some of these analyses here (https://github.com/GregorySchwartz/too-many-cells-paper-analyses). Let me know if this is helpful!

stat-hejia · Answer 2 · Wed Dec 09 2020 09:04:04 GMT+0800 (China Standard Time)

Thanks for the reply. I download some PBMCs data as my benchmark data set, and have true labels. just as you say, ARI and Silhouette unsuitable here. but because of the diversity, it seems too many clusters to evaluate, I dont know how to use the measures you recommend, and how many clusters(k) of my result. I learned the interlinkage you provided, but I've only learned R and Python, I can't understand the purity.hs, do you provide R or Python code or another way for my reference?