jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some JCP labels may be missing in compounds.csv.gz

hanslovsky opened this issue · comments

I am currently downloading metadata and images following #72 and then match the metadata from load_data_with_illum.parquet with the contents of well.csv.gz for the JCP2022 id and then double check that JCP2022 id is valid via compound.csv.gz, crispr.csv.gz, and orf.csv.gz. I found that JCP2022_028373 is in well.csv.gz but not in any of compound.csv.gz, crispr.csv.gz, .orf.csv.gz. This is how you can reproduce it:

In [76]: well = pd.read_csv('well.csv.gz')

In [77]: compound = pd.read_csv('compound.csv.gz')

In [78]: crispr = pd.read_csv('crispr.csv.gz')

In [79]: orf = pd.read_csv('orf.csv.gz')

In [80]: all_jcps = pd.concat([x[['Metadata_JCP2022']] for x in [compound, crispr, orf]])

In [81]: joined = well[['Metadata_JCP2022']].drop_duplicates().merge(all_jcps.assign(x=all_jcps.Metadata_JCP2022), 'left')

In [82]: joined[joined.x.isnull()]
Out[82]:
       Metadata_JCP2022    x
83013   JCP2022_UNKNOWN  NaN
135738   JCP2022_028373  NaN

As far as I can tell, all other JCPs in well.csv.gz have an according entry in one of the JCP files. I will treat this like JCP2022_UNKNOWN. Is there any chance that this missing JCP will be added to the metadata in the future?

Hi @hanslovsky, you can treat JCP2022_028373 as JCP2022_UNKNOWN. JCP2022_028373 should have been removed from the wells.csv.gz. We will fix that. Thanks for catching that!

Thank you @niranjchandrasekaran! Should JCP2022_028373 be removed completely in wells.csv.gz, or replaced by JCP2022_UNKNOWN?
I can make a PR with the change if that is helpful for you.

Hi @hanslovsky, thanks for offering to do that! We generate these metadata files using a couple of internal scripts. So we want to re-run them to keep everything reproducible. I will keep you posted. For the time being, you can ignore all wells with JCP2022_028373.

That makes sense. I will mark JCP2022_028373 unknown on my side.

Hi @hanslovsky, wells.csv.gz should now be fixed. Thanks for bringing this to our attention!