About the number of eigenvectors

Question

About the number of eigenvectors

yuntianzhishang opened this issue 3 months ago · comments

Hello,

Thank you for developing a very powerful package for studying developmental questions! I am using Palantir to analyze my data. I have two questions about the code in your palantir sample notebook.
(1) dm_res = palantir.utils.run_diffusion_maps(ad, n_components=5). Is the "n_components" here meaning the number of PCAs used for diffusion map analysis?
(2) ms_data = palantir.utils.determine_multiscale_space(ad). You explained "If you are specifying the number of eigen vectors manually in the above step, please ensure that the specified parameter is > 2. " I want to ask how to determine the reasonable number of eigenvectors. Could I draw a cumulative explained variance plot to show the variance of the eigen vectors in the data and determine the number of eigenvectors similar to what we decide the number of PCAs for the following analysis, like tSNE or UMAP? If so, should I choose the number of eigenvectors explain the 80-90% of the variance in the data?

Thank you very much!

Dominik J. Otto · Answer 1 · Fri Mar 01 2024 03:34:44 GMT+0800 (China Standard Time)

Hello @yuntianzhishang,

Thank you for reaching out and for your kind words about Palantir. I'm glad to hear that you're finding the package useful for your developmental studies. Let's dive into your questions:

n_components in Diffusion Maps:
- The n_components parameter in palantir.utils.run_diffusion_maps(ad, n_components=5) specifies the number of diffusion components (Eigenvectors of the transition matrix) to be computed, not the number of Principal Component Analysis (PCA) components. These diffusion components are derived from the data to capture its intrinsic geometry and are stored under ad.obsm["DM_EigenVectors"].
- To adjust the number of PCA components used, you would need to directly manipulate the PCA representation stored in ad.obsm["X_pca"]. This step is separate and precedes the diffusion map computation.
Determining the Number of Eigenvectors:
- Selecting an appropriate number of eigenvectors for multiscale space determination is complex and should be aligned with biological significance rather than purely statistical metrics. The eigenvalue gap method, analogous to PCA, can be employed by examining ad.uns["DM_EigenValues"] or utilizing palantir.utils.determine_multiscale_space's automatic selection, though it may not always clearly differentiate between signal and noise due to their continuous blending.
- A practical approach involves visualizing the diffusion components with UMAP using palantir.plot.plot_diffusion_components(ad) to qualitatively assess which components reflect meaningful biological processes. Given Palantir's resilience to the exact number of diffusion components used in downstream analyses, the choice of how many eigenvectors to employ can afford to be somewhat flexible.
- While drawing a cumulative explained variance plot is a common practice in PCA to decide on the number of components, its direct application to determining the number of eigenvectors in diffusion maps for the purpose of capturing a specific percentage of variance (like 80-90%) might not be as straightforward or indicative of capturing biologically relevant variation. This is due to the inherent differences in the objectives and outputs of PCA and diffusion maps, with the latter focusing more on revealing the manifold structure of the data rather than variance maximization.

I hope this clarifies your questions and aids you in your analysis. Should you have any more inquiries or require further assistance, please don't hesitate to ask.

Can Li · Answer 2 · Mon Mar 04 2024 13:27:20 GMT+0800 (China Standard Time)

Thank you very much for the detailed explanation! It's very helpful for my analysis!