change in output information

Question

change in output information

mkunst23 opened this issue a year ago · comments

Hi,

I have a request for the simple correlation based mapping (flat mapping). In addition to the best correlated cell type per query cell with it's average correlation score, can you also output a list of the 25 next best cluster with it's associated correlation scores?

Thanks,
Michael

Changkyu Lee · Answer 1 · Fri Jun 16 2023 07:28:18 GMT+0800 (China Standard Time)

Michael, correlation mapping result has a field “map.freq” in addition to best.map.df. “map.freq” report all clusters with average correlation that each cell is mapped to out of N (default 100) bootstrapping. Please check this output whether it serves your purpose. Thanks CK Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Michael Kunst ***@***.***> Sent: Friday, June 16, 2023 7:05:11 AM To: AllenInstitute/knowledge_graph_prototypes ***@***.***> Cc: Subscribed ***@***.***> Subject: [AllenInstitute/knowledge_graph_prototypes] change in output information (Issue #3) Hi, I have a request for the simple correlation based mapping (flat mapping). In addition to the best correlated cell type per query cell with it's average correlation score, can you also output a list of the 25 next best cluster with it's associated correlation scores? Thanks, Michael — Reply to this email directly, view it on GitHub<#3>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEXNJFVWSUTFHARHA3ODBV3XLN2IPANCNFSM6AAAAAAZIMSCAM>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

danielsf · Answer 2 · Fri Jun 16 2023 23:45:26 GMT+0800 (China Standard Time)

@mkunst23

Given that flat mapping works as follows

Randomly select 90% of marker genes
Find the most correlated cluster
Repeat (1) and (2) 100 times with a different 90% set of marker genes
Choose the cluster that came up "most correlated" in the plurality of the 100 iterations

How do you want to define "25 next best clusters"? Is this the 25 clusters that got the 2nd-25th most votes from bootstrapping?

Or do we need to choose the N most correlated clusters in (2) and come with a more complicated "vote counting" scheme that accounts for "clusterA was most correlated 15 times and second-most-correlated 10 times..."

?

Michael Kunst · Answer 3 · Sat Jun 17 2023 00:17:43 GMT+0800 (China Standard Time)

Hi Scott,

I would pick the first option. That way we can measure mapping quality by how often it confuses it with the nan-majority cluster.

danielsf · Answer 4 · Sat Jun 17 2023 00:35:28 GMT+0800 (China Standard Time)

so glad you said that: it will be the easiest to implement (once I can focus on this, which will clearly be middle of next week)

danielsf · Answer 5 · Thu Jul 13 2023 05:57:47 GMT+0800 (China Standard Time)

@mkunst23

I am finally getting around to addressing this issue.

My initial thought was to record the 25 "runner up" clusters and their average correlation coefficients in the extended output JSON file. This, however, would blow up that already large file from 2 GB to 16 GB (for the 4 million cell MERFISH data), so I think I may need to abandon my dream of an output JSON blob and accept the reality that we need to use a pandas dataframe written out to HDF5.

I have two schemes in mind. I've simulated examples here

/allen/aibs/technology/danielsf/knowledge_base/scratch/output_design

many_df.h5 records each level of the taxonomy in a separate dataframe. In Python, you would get the dataframe of cluster assignments with

import pandas
cluster_df = pandas.read_hdf('many_df.h5', key='CCN20230504_CLUS')

Similarly, you would get the dataframe of subclass assignments with

subclass_df = pandas.read_hdf('many_df.h5', key='CCN20230504_SUBC')

etc. Each dataframe has the same columns. The runner up assignments are in columns named runner_up_[0-25] and the corresponding correlation coefficients are in runner_up_[0-25]_cor (please note that this data is all randomly generated; I just wanted to simulate the shape).

single_df.h5 records all of the results at all taxonomic levels in a single dataframe. The columns a prefixed with the name of the taxonomic level, i.e.

CCN20230504_CLAS_assignment,
CCN20230504_CLAS_bootstrapping_probability,
...
CCN20230504_SUBC_assignment,
CCN20230504_SUBC_assignment,
...

The dataframe can be read in with

import pandas
df = pandas.read_hdf('single_df.h5', key='results')

I prefer the many_df.h5 shape. I do not like prefixing the column names with the taxonomic level. I'm not a fan of long column names. Is there a shape you prefer (can either of these be easily accessed in R)?

danielsf · Answer 6 · Wed Dec 13 2023 00:33:52 GMT+0800 (China Standard Time)

This was addressed a long time ago. The mapping tool now has an n_runners_up config parameter that specifies how many runner up assignments to output.