Pre-processing data?

Question

Pre-processing data?

mibo1996 opened this issue 3 years ago · comments

mb1996 commented 3 years ago

Hi,

I was wondering how pre-processing affects the data input for InferCNVpy.

I know that the data must be filtered for low-quality cells, normalized, and log-transformed before running InferCNVpy, but I was wondering if it matters if also computing highly variable genes, regressing out unwanted variables (e.g. total counts and percentage of mitochondrial genes), scaling, running PCA, computing neighbors, UMAP, clustering, and differential expression analysis on the data before running InferCNVpy affects the results?

I am mainly wondering if it matters where in the workflow InferCNV comes in, and if it does, where it should be? E.g. always before/after {these steps}, or if it doesn't matter.

Thank you

Gregor Sturm · Answer 1 · Thu Aug 12 2021 15:25:08 GMT+0800 (China Standard Time)

Hi @mibo1996,

that's a good questions and I can only answer it partly

highly variable genes

Infercny ignores highly variable gene annotation. Subsetting to highly variable genes would lead to significantly worse results as you loose a lot of information about genomic positions.

running PCA, computing neighbors, UMAP, clustering, and differential expression analysis

this should have no effect either as infercnvpy only uses information in adata.X which is left untouched by these functions.

regressing out unwanted variables (e.g. total counts and percentage of mitochondrial genes), scaling

This will have an effect, but I don't know to what extent. I would suspect that it mostly changes the scale of results but that it qualitatively is still highly similar. Let me know how it goes if you try it out!

This description of the computation steps should also help you understand what data is used and how:
https://icbi-lab.github.io/infercnvpy/infercnv.html#computation-steps

Cheers,
Gregor

Gregor Sturm · Answer 2 · Wed Sep 08 2021 16:18:43 GMT+0800 (China Standard Time)

Feel free to reopen if there are follow-up questions.