Question: feature selection for network plots, UMAP embedding

Question

Question: feature selection for network plots, UMAP embedding

kstrotjohann opened this issue 2 years ago · comments

Hey,

A question perhaps related to issue #32:
How does Pando select TFs and target genes to include in both UMAP and non-UMAP network graphs? I've observed that the resulting graphs contain more nodes and edges when I use the features argument to explicitly "feed" all TFs and targets present in the modules@meta to the get_network_graph() function, even when choosing umap_method = "none". Could you explain this behavior? In addition, could you provide an explanation for a layman on how the UMAP embedding works / can be interpreted? I've read the paper, but find this aspect difficult to understand.

I've done enrichment analysis based on the Pando-inferred TF modules to estimate TF activities in my cell populations. However, many of the highly scoring TFs do not even appear in the network graphs.

Unrelated: Can you provide an estimate on roughly how many cells are needed to build a robust network? Would it make sense to subset a Seurat object by cell type to infer cell type/region-specific GRNs, instead of incorporating region-specific accessibility profiles and TF expression into the base GRN like you did?

Many thanks!

jonas · Answer 1 · Wed Feb 08 2023 23:29:51 GMT+0800 (China Standard Time)

Hey, the UMAP embedding is calculated based on the ingoing connections of each node as well as coexpression, so if two genes are regulated by similar TFs and are expressed in the same cell types, then they would be close on the UMAP. Because not all genes have ingoing connections (some TFs for instance can be left without regulators), some will not be in the UMAP.

As for the number of cells, generally more is better IMO, but depending on the complexity of the dataset at least 1-2k would be good. It's for sure possible to infer the network in each celltype separately, although in my opinion it's preferable to do the inference on the full dataset instead. Since the network are built from covariance relationships between genes and peaks, the more variance the better :)