Project new dataset to the atlas
zhangnan0107 opened this issue · comments
Dear authors,
Thanks for sharing this valuable data and workflow. I wonder if you may have the script for projecting new dataset from the user onto the atlas. Or if there is a(n) parameter/argument in the nextflow pipeline can do such projection. Many thanks.
Best,
Nan
In principle, you can follow the Querying the Human Lung Cell Atlas tutorial from scvi-tools.
The authors of scvi-tools added our atlas to their model hub on hugging face. So if you want to use our Lung Cancer Atlas Reference (instead of the healthy lung reference by the human cell atlas project used in the tutorial), you can simply download the appropriate model from HuggingFace:
hubmodel = HubModel.pull_from_huggingface_hub("non-small-cell-lung-cancer")
adata = hubmodel.adata
model = hubmodel.model
For reference, here is that code that we used to project the data of the "extended" atlas onto the "core" atlas:
https://github.com/icbi-lab/luca/blob/89c4e6109bc723f6958cae7af791398b28e8e422/analyses/36_add_additional_datasets/36_scvi_mapping.py
But you are probably better off following that scvi-tools tutorial.
Thank you for the detailed information, which is quite helpful. There are several questions I'd like to clear up since the current tutorial is based on HLCA.
- Are there any requirement for query data if switching to NSCLC atlas? for example HLCA has the following requirements
-
Using the HLCA requires using Gene IDs for the query data
-
The query data should include batches in query_data.obs["dataset"]
-
It’s necessary to run query_data.obs["scanvi_label"] = "unlabeled" so that scvi-tools can properly register the query data.
- Would the parameters in model training also work on NSCLC atlas?
surgery_epochs = 500
train_kwargs_surgery = {
"early_stopping": True,
"early_stopping_monitor": "elbo_train",
"early_stopping_patience": 10,
"early_stopping_min_delta": 0.001,
"plan_kwargs": {"weight_decay": 0.0},
}
- I found there are multiple cell-type annotations in NSCLC reference data:
"cell_type", "cell_type_predicted", "cell_type_coarse","cell_type_tumor","cell_type_major"
. I'd like to ask which one(s) you would suggest to use in label transfer? - For the uncertainty threshold (0.2 for HLCA), do you have a recommended value?
Thanks!
Hi,
Are there any requirement for query data if switching to NSCLC atlas? for example HLCA has the following requirements
Using the HLCA requires using Gene IDs for the query dataThe query data should include batches in query_data.obs["dataset"]
It’s necessary to run query_data.obs["scanvi_label"] = "unlabeled" so that scvi-tools can properly register the query data.
- The LuCA reference uses gene symbols as identifiers.
- I have been using
sample
as batch identifier for scvi-tools - you'll also need to setup a column with cell-type labels. You can use the
cell_type
column from the reference dataset for that, and you'll need to assignunlabeled
to the new data
Would the parameters in model training also work on NSCLC atlas?
I just went with the defaults and it seemed to work fine. I'm not an expert in these models -- this is something you would need to ask the scvi-tools authors.
I found there are multiple cell-type annotations in NSCLC reference data: "cell_type", "cell_type_predicted", "cell_type_coarse","cell_type_tumor","cell_type_major". I'd like to ask which one(s) you would suggest to use in label transfer?
These are different resolutions, from coarse to fine: cell_type_coarse
> cell_type_major
> cell_type
> cell_type_tumor
.
You can forget about cell_type_predicted
.
You can use any or all of them for label transfer, depending on your needs.
For the uncertainty threshold (0.2 for HLCA), do you have a recommended value?
No idea, I'd just try with their value and iterate if needed.
Thanks a lot! That's really helpful!