Purity section in too-many-cells

Question

Purity section in too-many-cells

stat-hejia opened this issue 4 years ago · comments

Dear Gregory:
Thanks for building such a nice tool. I want to compare the accuracy of clustering algorithms, and measure how close between the clusters and the true labels, which called 'Cluster purity' in your article. I am a bit confused about this part, but I can not find this parameter in the help page of too-many-cell.
I tried to run the source code, but I have not learned the programming software used in 'purity' part, and it is difficult to me at present.
I wanna to ask whether there has a parameter of purity of the too-many-cell pipeline which I may ignored?Or whether you have the R code about 'purity' that can provide to me for reference?
Thanks for your time!

Gregory Schwartz · Answer 1 · Tue Oct 13 2020 21:16:20 GMT+0800 (China Standard Time)

The benchmarking is not included in the too-many-cells tool itself. Although, you can always use the diversity entry point to get the diversity of labels for the leaf nodes to see if they are close to 1. For the manuscript, the purity, entropy, and NMM were calculated post-clustering for all algorithms (to be consistent).

stat-hejia · Answer 2 · Wed Oct 14 2020 15:39:54 GMT+0800 (China Standard Time)

It seems to 'diversity' quantitate the effective number of cell states within a population, also can be used to compare the accuracy of clustering algorithms, Is my understanding right? I read your paper and the help document about too-many-cells, But I don't understand how 'diversity' is used to measure accuracy of clustering. It would be my pleasure if you could tell me something about it, or How can I supplement this knowledge?

Gregory Schwartz · Answer 3 · Wed Oct 14 2020 23:02:44 GMT+0800 (China Standard Time)

Yes, diversity can be used to compare. Diversity of order 1, for instance, is a transformation of Shannon entropy which translates it to a more biological context. I recommend reading https://onlinelibrary.wiley.com/doi/10.1111/j.2006.0030-1299.14714.x to understand the important distinction. We used more traditional comparison measures in the paper to make it more familiar. If you want to use another measure, however, you would have to calculate it yourself from the clustering output, although too-many-cells is more about separating than stopping, as the visualization can guide your chosen cluster size.

stat-hejia · Answer 4 · Sun Oct 18 2020 09:42:31 GMT+0800 (China Standard Time)

I studied the literature you recommended and got a preliminary understanding of relationship about diversity and entropy.
Thanks a lot for your help!