jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Merging on compounds InChIKey

cuongqn opened this issue · comments

Hello,

We’re trying to match compounds InChIKey between three JUMP metadata tables (JUMP-MOA compound metadata, JUMP-Target-2 compound metadata, full JUMP compound metadata in this repo) and observed the following overlaps:

  • JUMP-Target-2 and full JUMP after selecting only TARGET2 plates contain around 180 overlapping InChIKey (instead of the expected 300-ish)
  • JUMP-Target-2 and JUMP-MOA contains 18 overlapping InChIKey

Is this behavior expected when merging between the above metadata tables?

Reproducing the behavior

# Load JUMP metadata files
well = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/well.csv.gz")
compound = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/compound.csv.gz")
plate = pd.read_csv("https://github.com/jump-cellpainting/datasets/raw/main/metadata/plate.csv.gz")

# Load JUMP-MOA and JUMP-Target-2 compound metadata files
compound_moa = pd.read_csv("https://raw.githubusercontent.com/jump-cellpainting/JUMP-MOA/master/JUMP-MOA_compound_metadata.tsv", sep="\t")
compound_target2 = pd.read_csv("https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/master/JUMP-Target-2_compound_metadata.tsv", sep="\t")

# Merge metadata
metadata = well.merge(compound, on='Metadata_JCP2022', how="left")
metadata = metadata.merge(plate, on=['Metadata_Source', 'Metadata_Plate'])
metadata = metadata[metadata.Metadata_PlateType=="TARGET2"]

# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP compounds after selecting TARGET2 plates
print(set(compound_target2.InChIKey.unique()).intersection(set(metadata.Metadata_InChIKey.unique())).__len__()) # = 182

# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP-MOA
print(set(compound_target2.InChIKey.unique()).intersection(set(compound_moa.InChIKey.unique())).__len__()) # = 18

Thanks for reporting

This will fixed once we have released the updated map for Target2 via #80 and #86

This will fixed once we have released the updated map for Target2 via #80 and #86

I believe this should be fixed but please report back if not @cuongqn

# Load JUMP metadata files
well = pd.read_csv(
    "https://github.com/jump-cellpainting/datasets/raw/main/metadata/well.csv.gz"
)
compound = pd.read_csv(
    "https://github.com/jump-cellpainting/datasets/raw/main/metadata/compound.csv.gz"
)
plate = pd.read_csv(
    "https://github.com/jump-cellpainting/datasets/raw/main/metadata/plate.csv.gz"
)

# Load JUMP-MOA and JUMP-Target-2 compound metadata files
compound_moa = pd.read_csv(
    "https://raw.githubusercontent.com/jump-cellpainting/JUMP-MOA/master/JUMP-MOA_compound_metadata.tsv",
    sep="\t",
)
compound_target2 = pd.read_csv(
    "https://raw.githubusercontent.com/jump-cellpainting/JUMP-Target/master/JUMP-Target-2_compound_metadata.tsv",
    sep="\t",
)

# Merge metadata
metadata = well.merge(compound, on="Metadata_JCP2022", how="left")
metadata = metadata.merge(plate, on=["Metadata_Source", "Metadata_Plate"])
metadata_target2 = metadata[metadata.Metadata_PlateType == "TARGET2"]

# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP compounds after selecting TARGET2 plates
print(
    set(compound_target2.InChIKey.unique())
    .intersection(set(metadata_target2.Metadata_InChIKey.unique()))
    .__len__()
)  # = 302

# Get intersection of unique InChIKey between JUMP-Target-2 and JUMP-MOA
print(
    set(compound_target2.InChIKey.unique())
    .intersection(set(compound_moa.InChIKey.unique()))
    .__len__()
)  # = 18

# Get intersection of unique InChIKey between JUMP-MOA and JUMP compounds
print(
    set(compound_moa.InChIKey.unique())
    .intersection(set(metadata.Metadata_InChIKey.unique()))
    .__len__()
)  # = 76

# Get set diff of unique InChIKey between JUMP-MOA and JUMP compounds
print(
    set(compound_moa.InChIKey.unique()).difference(
        set(metadata.Metadata_InChIKey.unique())
    )
)
# {'GCWIQUVXWZWCLE-UHFFFAOYSA-N', 'XSIOKTWDEOJMGG-UHFFFAOYSA-O', 'AOJQBABIGYNZOY-UHFFFAOYSA-N', 'ODADKLYLWWCHNB-UHFFFAOYSA-N', 'XEVJUIZOZCFECP-UHFFFAOYSA-N', 'UHAXDAKQGVISBZ-UHFFFAOYSA-N', 'XAPVAKKLQGLNOY-UHFFFAOYSA-N', 'XIXXNJFWPAVKFR-UHFFFAOYSA-N'}

# Get set diff of unique InChIKey between JUMP-MOA and JUMP compounds
print(
    set(compound_moa.InChIKey.unique())
    .difference(set(metadata.Metadata_InChIKey.unique()))
    .__len__()
)  # 8

# Get set diff of unique InChIKey between JUMP-Target-2 and JUMP compounds
print(
    set(compound_target2.InChIKey.unique())
    .difference(set(metadata.Metadata_InChIKey.unique()))
    .__len__()
)  # 0