AllenInstitute / cell_type_mapper

Repository for storing prototype functionality implementations for the BKP

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

change in output information

mkunst23 opened this issue · comments

Hi,

I have a request for the simple correlation based mapping (flat mapping). In addition to the best correlated cell type per query cell with it's average correlation score, can you also output a list of the 25 next best cluster with it's associated correlation scores?

Thanks,
Michael

@mkunst23

Given that flat mapping works as follows

  1. Randomly select 90% of marker genes
  2. Find the most correlated cluster
  3. Repeat (1) and (2) 100 times with a different 90% set of marker genes
  4. Choose the cluster that came up "most correlated" in the plurality of the 100 iterations

How do you want to define "25 next best clusters"? Is this the 25 clusters that got the 2nd-25th most votes from bootstrapping?

Or do we need to choose the N most correlated clusters in (2) and come with a more complicated "vote counting" scheme that accounts for "clusterA was most correlated 15 times and second-most-correlated 10 times..."

?

Hi Scott,

I would pick the first option. That way we can measure mapping quality by how often it confuses it with the nan-majority cluster.

so glad you said that: it will be the easiest to implement (once I can focus on this, which will clearly be middle of next week)

@mkunst23

I am finally getting around to addressing this issue.

My initial thought was to record the 25 "runner up" clusters and their average correlation coefficients in the extended output JSON file. This, however, would blow up that already large file from 2 GB to 16 GB (for the 4 million cell MERFISH data), so I think I may need to abandon my dream of an output JSON blob and accept the reality that we need to use a pandas dataframe written out to HDF5.

I have two schemes in mind. I've simulated examples here

/allen/aibs/technology/danielsf/knowledge_base/scratch/output_design

many_df.h5 records each level of the taxonomy in a separate dataframe. In Python, you would get the dataframe of cluster assignments with

import pandas
cluster_df = pandas.read_hdf('many_df.h5', key='CCN20230504_CLUS')

Similarly, you would get the dataframe of subclass assignments with

subclass_df = pandas.read_hdf('many_df.h5', key='CCN20230504_SUBC')

etc. Each dataframe has the same columns. The runner up assignments are in columns named runner_up_[0-25] and the corresponding correlation coefficients are in runner_up_[0-25]_cor (please note that this data is all randomly generated; I just wanted to simulate the shape).

single_df.h5 records all of the results at all taxonomic levels in a single dataframe. The columns a prefixed with the name of the taxonomic level, i.e.

CCN20230504_CLAS_assignment,
CCN20230504_CLAS_bootstrapping_probability,
...
CCN20230504_SUBC_assignment,
CCN20230504_SUBC_assignment,
...

The dataframe can be read in with

import pandas
df = pandas.read_hdf('single_df.h5', key='results')

I prefer the many_df.h5 shape. I do not like prefixing the column names with the taxonomic level. I'm not a fan of long column names. Is there a shape you prefer (can either of these be easily accessed in R)?

This was addressed a long time ago. The mapping tool now has an n_runners_up config parameter that specifies how many runner up assignments to output.