Is it possible to identify marker genes?

Question

Is it possible to identify marker genes?

NikTuzov opened this issue 5 years ago · comments

Hello:

Suppose we ran a clustering method on bbknn output and identified a few clusters that hopefully represent distinct cell types. Do I get it right that it's impossible to identify marker genes for those clusters? BBKNN doesn't alter the original data or PCs obtained from the original data, so we never obtain the gene expression adjusted for batch effect. If I am right, is there a method to adjust the original data for batch effect using bbknn output?

Thanks in advance,
Nik

Krzysztof Polanski commented 5 years ago

It isn't.

Krzysztof Polanski · Answer 1 · Fri Jan 11 2019 17:17:07 GMT+0800 (China Standard Time)

Dear Nik,

You bring up an interesting point. True, the expression values still contain the batch effect, but the actual cell grouping in the clusters is done in a batch aligned space. From my experience, this diffuses the technical effect in marker identification to some degree, and makes it possible to annotate the populations. I asked another guy in the lab who's also used BBKNN, and he seems to second this sentiment.

Sincerely,
Krzysztof

NikTuzov · Answer 2 · Fri Jan 11 2019 23:11:40 GMT+0800 (China Standard Time)

I think my question was ill posed to begin with. Suppose there is no batch effect, but there are 10,000 genes in the study. Even if we can clearly define the cell types in terms of 2-3 derived features (such as PCs), each derived feature is a function of expression of many original genes, not just 1 or 2. In other words, we can come up with "marker derived features" but not with "marker genes".

It appears that whenever marker genes are used in BBKNN, CCA, and similar papers, the markers are known in advance and are used only for the purpose of sanity check. Please let me know what you think.

Regards,
Nik

Krzysztof Polanski · Answer 3 · Sat Jan 12 2019 00:57:03 GMT+0800 (China Standard Time)

At this point I'm unsure what you're asking, so I'll just address both points you brought up now:

Yes, PCs are a linear combination of the original gene space. The actual genes that drive the PCs are not examined anywhere in BBKNN or downstream analysis, and have zero impact on any marker identification.
Batch correction method papers tend to focus on showing that their algorithm is operational and in some way preferable to other available approaches. As such, the corrected output is annotated with known populations using canonically established markers. As a quick tie-in to my original response, notice that those markers are localised and allow for the identification of the populations' identities.

NikTuzov · Answer 4 · Fri Jan 18 2019 23:07:01 GMT+0800 (China Standard Time)

Thanks for replying. Do you have any ideas as to how to obtain the batch adjusted expression data with BBKNN?