Project new dataset to the atlas

Question

Project new dataset to the atlas

zhangnan0107 opened this issue a year ago · comments

Dear authors,

Thanks for sharing this valuable data and workflow. I wonder if you may have the script for projecting new dataset from the user onto the atlas. Or if there is a(n) parameter/argument in the nextflow pipeline can do such projection. Many thanks.

Best,
Nan

Gregor Sturm · Answer 1 · Mon Jul 24 2023 01:30:04 GMT+0800 (China Standard Time)

In principle, you can follow the Querying the Human Lung Cell Atlas tutorial from scvi-tools.

The authors of scvi-tools added our atlas to their model hub on hugging face. So if you want to use our Lung Cancer Atlas Reference (instead of the healthy lung reference by the human cell atlas project used in the tutorial), you can simply download the appropriate model from HuggingFace:

hubmodel = HubModel.pull_from_huggingface_hub("non-small-cell-lung-cancer")
adata = hubmodel.adata
model = hubmodel.model

For reference, here is that code that we used to project the data of the "extended" atlas onto the "core" atlas:
https://github.com/icbi-lab/luca/blob/89c4e6109bc723f6958cae7af791398b28e8e422/analyses/36_add_additional_datasets/36_scvi_mapping.py

But you are probably better off following that scvi-tools tutorial.

zhangnan0107 · Answer 2 · Thu Jul 27 2023 16:56:32 GMT+0800 (China Standard Time)

Thank you for the detailed information, which is quite helpful. There are several questions I'd like to clear up since the current tutorial is based on HLCA.

Are there any requirement for query data if switching to NSCLC atlas? for example HLCA has the following requirements

Using the HLCA requires using Gene IDs for the query data
The query data should include batches in query_data.obs["dataset"]
It’s necessary to run query_data.obs["scanvi_label"] = "unlabeled" so that scvi-tools can properly register the query data.

Would the parameters in model training also work on NSCLC atlas?

surgery_epochs = 500
train_kwargs_surgery = {
"early_stopping": True,
"early_stopping_monitor": "elbo_train",
"early_stopping_patience": 10,
"early_stopping_min_delta": 0.001,
"plan_kwargs": {"weight_decay": 0.0},
}

I found there are multiple cell-type annotations in NSCLC reference data: "cell_type", "cell_type_predicted", "cell_type_coarse","cell_type_tumor","cell_type_major". I'd like to ask which one(s) you would suggest to use in label transfer?
For the uncertainty threshold (0.2 for HLCA), do you have a recommended value?

Thanks!

Gregor Sturm · Answer 3 · Mon Jul 31 2023 15:50:45 GMT+0800 (China Standard Time)

Hi,

Are there any requirement for query data if switching to NSCLC atlas? for example HLCA has the following requirements
Using the HLCA requires using Gene IDs for the query data

The query data should include batches in query_data.obs["dataset"]

It’s necessary to run query_data.obs["scanvi_label"] = "unlabeled" so that scvi-tools can properly register the query data.

The LuCA reference uses gene symbols as identifiers.
I have been using sample as batch identifier for scvi-tools
you'll also need to setup a column with cell-type labels. You can use the cell_type column from the reference dataset for that, and you'll need to assign unlabeled to the new data

Would the parameters in model training also work on NSCLC atlas?

I just went with the defaults and it seemed to work fine. I'm not an expert in these models -- this is something you would need to ask the scvi-tools authors.

I found there are multiple cell-type annotations in NSCLC reference data: "cell_type", "cell_type_predicted", "cell_type_coarse","cell_type_tumor","cell_type_major". I'd like to ask which one(s) you would suggest to use in label transfer?

These are different resolutions, from coarse to fine: cell_type_coarse > cell_type_major > cell_type > cell_type_tumor.
You can forget about cell_type_predicted.

You can use any or all of them for label transfer, depending on your needs.

For the uncertainty threshold (0.2 for HLCA), do you have a recommended value?

No idea, I'd just try with their value and iterate if needed.

zhangnan0107 · Answer 4 · Mon Aug 21 2023 21:49:29 GMT+0800 (China Standard Time)

Thanks a lot! That's really helpful!