Teichlab / bbknn

Batch balanced KNN

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is it possible to identify marker genes?

NikTuzov opened this issue · comments

Hello:

Suppose we ran a clustering method on bbknn output and identified a few clusters that hopefully represent distinct cell types. Do I get it right that it's impossible to identify marker genes for those clusters? BBKNN doesn't alter the original data or PCs obtained from the original data, so we never obtain the gene expression adjusted for batch effect. If I am right, is there a method to adjust the original data for batch effect using bbknn output?

Thanks in advance,
Nik

Dear Nik,

You bring up an interesting point. True, the expression values still contain the batch effect, but the actual cell grouping in the clusters is done in a batch aligned space. From my experience, this diffuses the technical effect in marker identification to some degree, and makes it possible to annotate the populations. I asked another guy in the lab who's also used BBKNN, and he seems to second this sentiment.

Sincerely,
Krzysztof

I think my question was ill posed to begin with. Suppose there is no batch effect, but there are 10,000 genes in the study. Even if we can clearly define the cell types in terms of 2-3 derived features (such as PCs), each derived feature is a function of expression of many original genes, not just 1 or 2. In other words, we can come up with "marker derived features" but not with "marker genes".

It appears that whenever marker genes are used in BBKNN, CCA, and similar papers, the markers are known in advance and are used only for the purpose of sanity check. Please let me know what you think.

Regards,
Nik

At this point I'm unsure what you're asking, so I'll just address both points you brought up now:

  • Yes, PCs are a linear combination of the original gene space. The actual genes that drive the PCs are not examined anywhere in BBKNN or downstream analysis, and have zero impact on any marker identification.
  • Batch correction method papers tend to focus on showing that their algorithm is operational and in some way preferable to other available approaches. As such, the corrected output is annotated with known populations using canonically established markers. As a quick tie-in to my original response, notice that those markers are localised and allow for the identification of the populations' identities.

Thanks for replying. Do you have any ideas as to how to obtain the batch adjusted expression data with BBKNN?

It isn't.