Align `label_projection` task with `openpipelines-bio/openpipelines/src/labels_transfer`

Question

Align `label_projection` task with `openpipelines-bio/openpipelines/src/labels_transfer`

rcannood opened this issue 2 years ago · comments

It would be nice to align the codebase for the OpenProblems and OpenPipelines label transfer methods, so that we know how well the label transfer methods in OpenPipelines work (cc: @VladimirShitov , related to openpipelines-bio/openpipeline#221).

Input / output files are different:
OpenProblems: input_train, input_test, output
OpenPipelines: input, reference, output
Possible solutions:
- This is OK, we can use a regex to rename the different fields when porting code from one to another
- It's better to rename the OpenProblems arguments to those in used in OpenPipelines so the syntax is aligned
OpenPipelines allows choosing which obs(m) slot to retrieve / store information
Example: reference_obsm_key, query_obsm_key, targets
Possible solutions:
- Rename arguments to reference_obsm_embedding, reference_obs_label, input_obsm_embedding, output_obs_label.
- ...
TBD: some methods use different embeddings

Vladimir Shitov · Answer 1 · Wed Jan 18 2023 17:42:59 GMT+0800 (China Standard Time)

Regarding the last point for discussion. I believe, this is common to use latent embedding space to train the classifier, and then to predict the labels. At least, this is what is done in HLCA.

However, all the methods in open problems (except for scANVI) currently use gene expression space. I think it would be great to add classifiers trained on latent embedding in open problems.

MalteDLuecken · Answer 2 · Wed Jan 18 2023 18:26:19 GMT+0800 (China Standard Time)

I think this should be discussed... The approaches that we have in open pipelines are reference mapping approaches. We don't have that in open problems yet. Also because that requires loading large reference datasets... i wouldn't align the reference-mapping label projection with the open problems task.

Robrecht Cannoodt · Answer 3 · Wed Jan 25 2023 21:27:23 GMT+0800 (China Standard Time)

If I understand correctly, your point is that the label transfer methods in the OpenPipelines PR should be used to predict labels on a new dataset given a large reference dataset, whereas in OpenProblems the label projection benchmark has been using smaller datasets (where the batch label is used to split up existing datasets).

However, the metrics and the methods in the OpenProblems repo should still apply to the use case you're describing, and it's just a matter of selecting new datasets which cover that particular use case.

Alternatively, we could label methods which are meant to work well within the same (partially labelled) dataset as category: within_dataset, and methods which are designed to work with large reference datasets (pan-dataset) as category: pan_dataset (proposed labels are awful but you get the idea).

Robrecht Cannoodt · Answer 4 · Wed Jan 25 2023 21:34:41 GMT+0800 (China Standard Time)

Vladimir brought up that the same code base is already being used in multiple places, so instead of aligning the two code bases we could instead put those functions in a separate package. IMO that would be fine by me but then we have to think about what to do with R methods.

Vladimir Shitov · Answer 5 · Wed Jan 25 2023 21:41:24 GMT+0800 (China Standard Time)

Also, the code doesn't assume any particular datasets, it just requires inputs, which you can change between projects. What I find weird is that now we just copy and paste the same labels' transfer code between different repositories.

Maybe it makes sense to create another small package with labels' transfer code, and just import it in different projects?

MalteDLuecken · Answer 6 · Wed Jan 25 2023 22:46:56 GMT+0800 (China Standard Time)

Let's disc this later today.

However, the metrics and the methods in the OpenProblems repo should still apply to the use case you're describing, and it's just a matter of selecting new datasets which cover that particular use case.

The metrics from open problems should apply, but not the methods. It's not just using a large dataset, but using a trained reference model in open pipelines.

Also, I wouldn't put the code into a separate package... that's not the scope for open pipelines. There is another project that we're working on for more-or-less this... but with a lot more around it. It's too early to depend one project on the other.