Merging on compounds InChIKey
cuongqn opened this issue · comments
Cuong Nguyen commented
Hello,
We’re trying to match compounds InChIKey between three JUMP metadata tables (JUMP-MOA compound metadata, JUMP-Target-2 compound metadata, full JUMP compound metadata in this repo) and observed the following overlaps:
- JUMP-Target-2 and full JUMP after selecting only TARGET2 plates contain around 180 overlapping InChIKey (instead of the expected 300-ish)
- JUMP-Target-2 and JUMP-MOA contains 18 overlapping InChIKey
Is this behavior expected when merging between the above metadata tables?
Reproducing the behavior
# Load JUMP metadata files
well = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/well.csv.gz")
compound = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/compound.csv.gz")
plate = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/plate.csv.gz")
# Load JUMP-MOA and JUMP-Target-2 compound metadata files
compound_moa = pd.read_csv("https://raw.githubusercontent.com/jump-cellpainting/JUMP-MOA/master/JUMP-MOA_compound_metadata.tsv", sep="\t")
compound_target2 = pd.read_csv("https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/master/JUMP-Target-2_compound_metadata.tsv", sep="\t")
# Merge metadata
metadata = well.merge(compound, on='Metadata_JCP2022', how="left")
metadata = metadata.merge(plate, on=['Metadata_Source', 'Metadata_Plate'])
metadata = metadata[metadata.Metadata_PlateType=="TARGET2"]
# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP compounds after selecting TARGET2 plates
print(set(compound_target2.InChIKey.unique()).intersection(set(metadata.Metadata_InChIKey.unique())).__len__()) # = 182
# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP-MOA
print(set(compound_target2.InChIKey.unique()).intersection(set(compound_moa.InChIKey.unique())).__len__()) # = 18
Shantanu Singh commented
Shantanu Singh commented
This will fixed once we have released the updated map for Target2 via #80 and #86
I believe this should be fixed but please report back if not @cuongqn
# Load JUMP metadata files
well = pd.read_csv(
"https://github.com/jump-cellpainting/datasets/raw/main/metadata/well.csv.gz"
)
compound = pd.read_csv(
"https://github.com/jump-cellpainting/datasets/raw/main/metadata/compound.csv.gz"
)
plate = pd.read_csv(
"https://github.com/jump-cellpainting/datasets/raw/main/metadata/plate.csv.gz"
)
# Load JUMP-MOA and JUMP-Target-2 compound metadata files
compound_moa = pd.read_csv(
"https://raw.githubusercontent.com/jump-cellpainting/JUMP-MOA/master/JUMP-MOA_compound_metadata.tsv",
sep="\t",
)
compound_target2 = pd.read_csv(
"https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/master/JUMP-Target-2_compound_metadata.tsv",
sep="\t",
)
# Merge metadata
metadata = well.merge(compound, on="Metadata_JCP2022", how="left")
metadata = metadata.merge(plate, on=["Metadata_Source", "Metadata_Plate"])
metadata_target2 = metadata[metadata.Metadata_PlateType == "TARGET2"]
# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP compounds after selecting TARGET2 plates
print(
set(compound_target2.InChIKey.unique())
.intersection(set(metadata_target2.Metadata_InChIKey.unique()))
.__len__()
) # = 302
# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP-MOA
print(
set(compound_target2.InChIKey.unique())
.intersection(set(compound_moa.InChIKey.unique()))
.__len__()
) # = 18
# Get intersection of unique InChIKey between JUMP-MOA and JUMP compounds
print(
set(compound_moa.InChIKey.unique())
.intersection(set(metadata.Metadata_InChIKey.unique()))
.__len__()
) # = 76
# Get set diff of unique InChIKey between JUMP-MOA and JUMP compounds
print(
set(compound_moa.InChIKey.unique()).difference(
set(metadata.Metadata_InChIKey.unique())
)
)
# {'GCWIQUVXWZWCLE-UHFFFAOYSA-N', 'XSIOKTWDEOJMGG-UHFFFAOYSA-O', 'AOJQBABIGYNZOY-UHFFFAOYSA-N', 'ODADKLYLWWCHNB-UHFFFAOYSA-N', 'XEVJUIZOZCFECP-UHFFFAOYSA-N', 'UHAXDAKQGVISBZ-UHFFFAOYSA-N', 'XAPVAKKLQGLNOY-UHFFFAOYSA-N', 'XIXXNJFWPAVKFR-UHFFFAOYSA-N'}
# Get set diff of unique InChIKey between JUMP-MOA and JUMP compounds
print(
set(compound_moa.InChIKey.unique())
.difference(set(metadata.Metadata_InChIKey.unique()))
.__len__()
) # 8
# Get set diff of unique InChIKey between JUMP-Target-2 and JUMP compounds
print(
set(compound_target2.InChIKey.unique())
.difference(set(metadata.Metadata_InChIKey.unique()))
.__len__()
) # 0