Some JCP labels may be missing in compounds.csv.gz
hanslovsky opened this issue · comments
I am currently downloading metadata and images following #72 and then match the metadata from load_data_with_illum.parquet
with the contents of well.csv.gz
for the JCP2022 id and then double check that JCP2022 id is valid via compound.csv.gz
, crispr.csv.gz
, and orf.csv.gz
. I found that JCP2022_028373
is in well.csv.gz
but not in any of compound.csv.gz
, crispr.csv.gz
, .orf.csv.gz
. This is how you can reproduce it:
In [76]: well = pd.read_csv('well.csv.gz')
In [77]: compound = pd.read_csv('compound.csv.gz')
In [78]: crispr = pd.read_csv('crispr.csv.gz')
In [79]: orf = pd.read_csv('orf.csv.gz')
In [80]: all_jcps = pd.concat([x[['Metadata_JCP2022']] for x in [compound, crispr, orf]])
In [81]: joined = well[['Metadata_JCP2022']].drop_duplicates().merge(all_jcps.assign(x=all_jcps.Metadata_JCP2022), 'left')
In [82]: joined[joined.x.isnull()]
Out[82]:
Metadata_JCP2022 x
83013 JCP2022_UNKNOWN NaN
135738 JCP2022_028373 NaN
As far as I can tell, all other JCPs in well.csv.gz
have an according entry in one of the JCP files. I will treat this like JCP2022_UNKNOWN
. Is there any chance that this missing JCP will be added to the metadata in the future?
Hi @hanslovsky, you can treat JCP2022_028373
as JCP2022_UNKNOWN
. JCP2022_028373
should have been removed from the wells.csv.gz
. We will fix that. Thanks for catching that!
Thank you @niranjchandrasekaran! Should JCP2022_028373
be removed completely in wells.csv.gz
, or replaced by JCP2022_UNKNOWN
?
I can make a PR with the change if that is helpful for you.
Hi @hanslovsky, thanks for offering to do that! We generate these metadata files using a couple of internal scripts. So we want to re-run them to keep everything reproducible. I will keep you posted. For the time being, you can ignore all wells with JCP2022_028373
.
That makes sense. I will mark JCP2022_028373
unknown on my side.
Hi @hanslovsky, wells.csv.gz
should now be fixed. Thanks for bringing this to our attention!