openproblems-bio / openproblems-v2

Formalizing and benchmarking open problems in single-cell genomics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Align `label_projection` task with `openpipelines-bio/openpipelines/src/labels_transfer`

rcannood opened this issue · comments

It would be nice to align the codebase for the OpenProblems and OpenPipelines label transfer methods, so that we know how well the label transfer methods in OpenPipelines work (cc: @VladimirShitov , related to openpipelines-bio/openpipeline#221).

  • Input / output files are different:
    OpenProblems: input_train, input_test, output
    OpenPipelines: input, reference, output
    Possible solutions:

    • This is OK, we can use a regex to rename the different fields when porting code from one to another
    • It's better to rename the OpenProblems arguments to those in used in OpenPipelines so the syntax is aligned
  • OpenPipelines allows choosing which obs(m) slot to retrieve / store information
    Example: reference_obsm_key, query_obsm_key, targets
    Possible solutions:

    • Rename arguments to reference_obsm_embedding, reference_obs_label, input_obsm_embedding, output_obs_label.
    • ...
  • TBD: some methods use different embeddings

Regarding the last point for discussion. I believe, this is common to use latent embedding space to train the classifier, and then to predict the labels. At least, this is what is done in HLCA.

However, all the methods in open problems (except for scANVI) currently use gene expression space. I think it would be great to add classifiers trained on latent embedding in open problems.

I think this should be discussed... The approaches that we have in open pipelines are reference mapping approaches. We don't have that in open problems yet. Also because that requires loading large reference datasets... i wouldn't align the reference-mapping label projection with the open problems task.

If I understand correctly, your point is that the label transfer methods in the OpenPipelines PR should be used to predict labels on a new dataset given a large reference dataset, whereas in OpenProblems the label projection benchmark has been using smaller datasets (where the batch label is used to split up existing datasets).

However, the metrics and the methods in the OpenProblems repo should still apply to the use case you're describing, and it's just a matter of selecting new datasets which cover that particular use case.

Alternatively, we could label methods which are meant to work well within the same (partially labelled) dataset as category: within_dataset, and methods which are designed to work with large reference datasets (pan-dataset) as category: pan_dataset (proposed labels are awful but you get the idea).

Vladimir brought up that the same code base is already being used in multiple places, so instead of aligning the two code bases we could instead put those functions in a separate package. IMO that would be fine by me but then we have to think about what to do with R methods.

Also, the code doesn't assume any particular datasets, it just requires inputs, which you can change between projects. What I find weird is that now we just copy and paste the same labels' transfer code between different repositories.

Maybe it makes sense to create another small package with labels' transfer code, and just import it in different projects?

Let's disc this later today.

However, the metrics and the methods in the OpenProblems repo should still apply to the use case you're describing, and it's just a matter of selecting new datasets which cover that particular use case.

The metrics from open problems should apply, but not the methods. It's not just using a large dataset, but using a trained reference model in open pipelines.

Also, I wouldn't put the code into a separate package... that's not the scope for open pipelines. There is another project that we're working on for more-or-less this... but with a lot more around it. It's too early to depend one project on the other.