Gene networks inferred from single-cell genomic data (scRNASeq., scATACseq., multi-omics and Perturb-seq.) are useful in discovering contextual biological mechanisms. These networks can be viewed as data-driven hypotheses of gene interactions. We aim to implement a flexible framework to evaluate the plausibility of networks inferred by computational methods. The assessment is broken down into themes such as goodness if fit (ability to explain the data), co-regulation, mechanistic interactions etc. Under each theme, multiple evaluation tasks are conceptualised and implemented using appropriate statistical tests.
Gene network inference methods are further classified as gene program inference, gene regulatory network (GRN) inference and enchancer-gene (E2G) linking methods. An overview of each method class is given below.
Gene program inference methods are "factor analysis" style latent variable models that decompose single-cell data into a
where
A few examples of such models are cNMF, LDVAE, f-scLVM. More advanced models consider additional inputs or work on multi-omic data such as Spectra, muVI.
Gene regulatory networks are tri-partite graphs connecting transcription factors (TFs) - candidate regulatory elements (CREs) - genes. Such methods aim to model the regulatory process governing gene expression and are typically trained using expression and genomic data together. Examples of such methods are SCENIC+.
Enhancer-gene linking methods aim to identify and link enhancer sequences to genes. The methods require genomic data such as ATACseq. but may integrate expression data as well. The methods aim to quantify the regulatory impact of enhancer sequences on downstream genes (and gene expression). Examples of such methods are scE2G, ABC and Enhlink.
- Single-cell omics data and outputs from computational inference are stored in the mudata format (see mudata documentation).
- Templates for implementing evaluations or method-wrappers can be found in
src/evaluation/
andsrc/inference/
respectively. - Evaluations and methods implemented under
src/
are stitched together in an evaluation pipelinesmk/
Criterion | Implementation | External resource | Interpretation | Caveats |
---|---|---|---|---|
Goodness of fit | Explained variance per program | None | A program explaining more variance in the data might represent dominant biological variation. | Technical variation might be the highest source of variance (e.g. batch effects). |
Variation across category levels | Kruskall-Wallis non-parametric ANOVA + Dunn's posthoc test | None | If program scores are variable between batch levels then the component likely is modelling technical noise. Alternatively, if program scores are variable between a biological category like cell-type or condition then the program is likely modelling a biological process specific to the category. | If batches are confounded with biological conditions, then the relative contribution of technical and biological variation cannot be decomposed. |
Gene-set enrichment | GSEA using program x feature scores | MsigDB, Enrichr | If a program is significantly associated with a gene-set then it could explain the biological process the program represents | |
Motif enrichment | Pearson correlation of motif counts per gene (promoter or enchancer) and program x gene scores | HOCOMOCO v12 | If genes with high contributions to a program are also enriched with same enhancer/promoter motifs they could be co-regulated | A biological pathway could involve genes with different regulation but still contribute to a common function |
Co-regulation | TF-gene links | If a program contains genes that are regulated by TFs that form a regulatory module then it indicates mechansitic commonality | ||
Perturbation sensitivity | Perturbation data | Cell x programs score distribution shifts greater than expected due to the direct effect of perturbation on genes in the program could indicate hierarchical relationships b/w genes in the program | Expression of genes upstream of the perturbed gene are unlikely to be affected | |
Cross-modality prediction | Multi-omic data | If program x feature scores learnt from one modality can lead to a good fit for another modality/dataset indicate a robust biological connection b/w genes in the program | Technical variation b/w datasets and mapping features b/w modalities are practical challenges |