Some JCP labels may be missing in compounds.csv.gz

Question

Some JCP labels may be missing in compounds.csv.gz

hanslovsky opened this issue a year ago · comments

I am currently downloading metadata and images following #72 and then match the metadata from load_data_with_illum.parquet with the contents of well.csv.gz for the JCP2022 id and then double check that JCP2022 id is valid via compound.csv.gz, crispr.csv.gz, and orf.csv.gz. I found that JCP2022_028373 is in well.csv.gz but not in any of compound.csv.gz, crispr.csv.gz, .orf.csv.gz. This is how you can reproduce it:

In [76]: well = pd.read_csv('well.csv.gz')

In [77]: compound = pd.read_csv('compound.csv.gz')

In [78]: crispr = pd.read_csv('crispr.csv.gz')

In [79]: orf = pd.read_csv('orf.csv.gz')

In [80]: all_jcps = pd.concat([x[['Metadata_JCP2022']] for x in [compound, crispr, orf]])

In [81]: joined = well[['Metadata_JCP2022']].drop_duplicates().merge(all_jcps.assign(x=all_jcps.Metadata_JCP2022), 'left')

In [82]: joined[joined.x.isnull()]
Out[82]:
       Metadata_JCP2022    x
83013   JCP2022_UNKNOWN  NaN
135738   JCP2022_028373  NaN

As far as I can tell, all other JCPs in well.csv.gz have an according entry in one of the JCP files. I will treat this like JCP2022_UNKNOWN. Is there any chance that this missing JCP will be added to the metadata in the future?

Niranj Chandrasekaran · Answer 1 · Sat Aug 05 2023 06:43:56 GMT+0800 (China Standard Time)

Hi @hanslovsky, you can treat JCP2022_028373 as JCP2022_UNKNOWN. JCP2022_028373 should have been removed from the wells.csv.gz. We will fix that. Thanks for catching that!

Philipp Hanslovsky · Answer 2 · Tue Aug 08 2023 00:05:06 GMT+0800 (China Standard Time)

Thank you @niranjchandrasekaran! Should JCP2022_028373 be removed completely in wells.csv.gz, or replaced by JCP2022_UNKNOWN?
I can make a PR with the change if that is helpful for you.

Niranj Chandrasekaran · Answer 3 · Tue Aug 08 2023 00:18:45 GMT+0800 (China Standard Time)

Hi @hanslovsky, thanks for offering to do that! We generate these metadata files using a couple of internal scripts. So we want to re-run them to keep everything reproducible. I will keep you posted. For the time being, you can ignore all wells with JCP2022_028373.

Philipp Hanslovsky · Answer 4 · Tue Aug 08 2023 00:22:07 GMT+0800 (China Standard Time)

That makes sense. I will mark JCP2022_028373 unknown on my side.

Niranj Chandrasekaran · Answer 5 · Tue Aug 08 2023 03:03:17 GMT+0800 (China Standard Time)

Hi @hanslovsky, wells.csv.gz should now be fixed. Thanks for bringing this to our attention!