snap-stanford / UCE

UCE is a zero-shot foundation model for single-cell gene expression data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Original log-norm expression space

chrarnold opened this issue · comments

Hi, thanks for the paper, really nice work and a nice read. However, I am having a question that I would appreciate your input to:

It is quite hidden but in Suppl. Figure 1 and Suppl. Table 1 you mention a "log-norm expression" space that you also used for the integration task. Visually from Suppl. Fig 1, this looks also very clean and similar to UCE, and the data in Suppl. Table 1 confirm this - 2nd best score overall with the weighting you do there.

My question now is: If just taking the (shared) log-norm expression vectors (you mention this "only" 5.7k genes) is almost as good as a DL mode, why go through the hassle of building and training a DL model in the first place if a much simpler approach seems to be give comparable results? What am I missing? I understand that taking the "shared" genes is a limitation but even with 30+ million cells, there are still enough cells left that are expressed it seems?

Thanks for the question.

The tabula sapiens v2 integration is not done on 5.7k genes, it is done on the set of genes that are common for tabula sapiens v2, which is 19,567.

The raw data for tabula sapiens v2 is very high quality and processed by the same labs using the same workflows, in order to minimize batch effects, so it makes sense that when comparing it to itself, the raw data performs well. We used tabula sapiens v2 as our main benchmark because we know that it can serve as a true zero shot benchmark.

The results using raw gene expression space compared tabula lung to the 30 M human cells, and showed a significant improvement in performance (paragraph starting at line 250 on page 11 of the main text).

Hi, thanks for the reply! I am a bit confused... I talked about Suppl. Fig. 1, which doesnt say anything about "tabula lung", just integration of v2 into the existing pre-trained model, right? I did not talk about Fig 3, sorry for the potential confusion.

I talked about line 456 in the manuscript: "In case of the original data representation, the set of 5704 shared genes across all human datasets were used to represent each cell.". This is what I thought Suppl. Fig. 1 depicts for the log-norm expression space, but maybe I understood that wrongly?

The set of 5704 shared genes is for the comparison in figure 3. It is not related to supplementary figure 1. All human datasets means all datasets with human data in IMA is which 250+ datasets. Tabula Sapiens v2 would count as one dataset. Sorry for the confusion.