jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Resolve inconsistencies in Target2 Compound InChIKeys

FrenkT opened this issue · comments

Hi all,

As a follow up from #77, I have been trying to map compound identifiers mentioned in the Target-2 plate map and metadata with compound identifiers provided for Target-2 plates in the JUMP metadata files.
As a result, I found 36 (out of 384) wells for which the compound in the JUMP metadata doesn't match the Target-2 metadata:

Well InChI Expected InChI Found
A03 KRGQEOSDQHTZMX-IGCYCDGOSA-N LPYXWGMUVRGUOY-UHFFFAOYSA-N
A06 ODHCTXKNWHHXJC-VKHMYHEASA-N GUUGZPSUOTWOMD-UHFFFAOYSA-N
A12 NSFFHOGKXHRQEW-DVRIZHICSA-N UTBOEBCWXGDOGI-UHFFFAOYSA-N
B01 LLPBUXODFQZPFH-UHFFFAOYSA-N AJVXVYTVAAWZAP-UHFFFAOYSA-N
B05 CVOUSAVHMDXCKG-UHFFFAOYSA-N ROBYKNONIPZMTK-UHFFFAOYSA-N
B24 QTQAWLPCGQOSGP-PHLMVCJGSA-N HGMSUJCQIUFZBJ-UHFFFAOYSA-N
C13 CXJCGSPAPOTTSF-VURMDHGXSA-N DXZRBHUCOHBAHP-UHFFFAOYSA-N
C24 HTIQEAQVCYTUBX-UHFFFAOYSA-N YMDXSGBNCBQYGC-UHFFFAOYSA-N
D02 LXENKEWVEVKKGV-BQYQJAHWSA-N VSVFLGPUZJTBSD-UHFFFAOYSA-N
D08 BMKPVDQDJQWBPD-UHFFFAOYSA-N DUKQPWDVIZDABV-UHFFFAOYSA-N
E04 RZTAMFZIAATZDJ-UHFFFAOYSA-N HAQDEJPEAKWAAM-UHFFFAOYSA-N
F18 KOCVKGYKBLJEPK-LYBHJNIJSA-N WRLVHADVOGFZOZ-UHFFFAOYSA-N
F23 KVWDHTXUZHCGIO-UHFFFAOYSA-N WXPNDRBBWZMPQG-UHFFFAOYSA-N
G05 WBGKWQHBNHJJPZ-LECWWXJVSA-N ZMUSCGJNJYXJBP-UHFFFAOYSA-N
G06 POJZIZBONPAWIV-UHFFFAOYSA-N GQXSULRYFDAMOO-UHFFFAOYSA-N
G15 VYMDGNCVAMGZFE-UHFFFAOYSA-N PKYKNPLSFOKASK-UHFFFAOYSA-N
H24 UREBDLICKHMUKA-QCYOSJOCSA-N GJFCONYVAUNLKB-UHFFFAOYSA-N
I06 KGPGQDLTDHGEGT-SZUNQUCBSA-N LQERMDXPGNOJCT-UHFFFAOYSA-N
I18 VDJHFHXMUKFKET-WDUFCVPESA-N HULPONUAINYLQQ-UHFFFAOYSA-N
J02 NSFFHOGKXHRQEW-AIHSUZKVSA-N UTBOEBCWXGDOGI-UHFFFAOYSA-N
J07 XKFTZKGMDDZMJI-HSZRJFAPSA-N KRBSMMVJJVHVCB-UHFFFAOYSA-N
J14 XKFTZKGMDDZMJI-HSZRJFAPSA-N KRBSMMVJJVHVCB-UHFFFAOYSA-N
K02 UREBDLICKHMUKA-CXSFZGCWSA-N GJFCONYVAUNLKB-UHFFFAOYSA-N
K05 GIUYCYHIANZCFB-FJFJXFQQSA-N CAOWNCTTWGSKDO-UHFFFAOYSA-N
K13 SJFBTAPEPRWNKH-CCKFTAQKSA-N XUZQTIZWMHMWOC-UHFFFAOYSA-N
L06 OHRURASPPZQGQM-GCCNXGTGSA-N SOOPLNPQGWJZHY-UHFFFAOYSA-N
L11 NHFDRBXTEDBWCZ-ZROIWOOFSA-N GMROZDPZEUVIGD-UHFFFAOYSA-N
N09 MBGGBVCUIVRRBF-UHFFFAOYSA-N AUMHDRMJJNZTPB-UHFFFAOYSA-N
N14 HTSLEZOTMYUPLU-UHFFFAOYSA-N AGNWVEJTZJIJIM-UHFFFAOYSA-N
O10 DEQANNDTNATYII-UHFFFAOYSA-N JDKKNQACNITFEA-UHFFFAOYSA-N
O14 FAIIFDPAEUKBEP-UHFFFAOYSA-N KJWGEXJCWCYEMI-UHFFFAOYSA-N
P01 AOZPVMOOEJAZGK-UHFFFAOYSA-N UXUQIRNFBFRPAC-UHFFFAOYSA-N
P03 HFPLHASLIOXVGS-UHFFFAOYSA-N CANBMWXJDLUDFF-UHFFFAOYSA-N
P12 UIAGMCDKSXEBJQ-UHFFFAOYSA-N SVMHYHIZWOJKDL-UHFFFAOYSA-N
P18 ZDXUKAKRHYTAKV-UHFFFAOYSA-N PHOGQKDIVUJGMJ-UHFFFAOYSA-N
P23 YYDUWLSETXNJJT-MTJSOVHGSA-N LNFZRMDSZJCZTG-UHFFFAOYSA-N

As you can see, the first layer of the InChIKey is different, so the mismatch shouldn't be due to just missing stereochemical information.
Note that each row of the table applies to all of the TARGET2 plates described in the metadata files except for those coming from source_9 (I excluded these from my analysis code because they have a 1536 well layout and I wanted to keep things simple, see #77) and plate CP1-SC2-25 from source_7 (similarly, because it seems like the plate has a mirrored layout, see #77). So I ran this check on 131 plates, and for all of them I can find the differences described in the table above.

Any idea on whether the compounds used in the experiments are actually different, as suggested by the InChIKeys, or whether there is some issue in the metadata files provided in this repo?

Hi @FrenkT,

Thank you for the detailed report and bringing this to our attention. I can confirm that the compounds are not different, which would mean that the metadata provided is not correct. I will look into this and talk to others who know more about how the metadata was generated and get back to you.

@srijitseal will post more insights later, but for now, it appears that running StandardizeMolecule.py on JUMP-Target-2_compound_metadata.tsv resolves the discrepancies

However, it does add a new discrepancy – we now have 301 unique entries, not 303, in the Target2 set.

That's because in addition to "duplicates" of BVT-948, dexamethasone, and thiostrepton (noted here) jump-cellpainting/JUMP-Target#9 (comment), we also notes "duplicates" of ME-0328 and quinidine/quinine

https://chat.openai.com/share/fa71ea33-0bc7-4699-b03a-a0ba41353164

                                   SMILES_standardized     pert_iname  
22                CC1(C)C(=O)N=C2c3ccccc3C(=O)C(=O)C21        BVT-948  
266               CC1(C)C(=O)N=C2c3ccccc3C(=O)C(=O)C21        BVT-948  
85           CC(NC(=O)CCc1nc(=O)c2ccccc2[nH]1)c1ccccc1        ME-0328  
146          CC(NC(=O)CCc1nc(=O)c2ccccc2[nH]1)c1ccccc1        ME-0328  
150  CC1CC2C3CC=C4CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O...  dexamethasone  
191  CC1CC2C3CC=C4CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O...  dexamethasone  
143              C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12      quinidine  
1                C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12        quinine  
10   C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC...   thiostrepton  
171  C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC...   thiostrepton 

In any case, this should keep you going for now @FrenkT


Updates

Per @srijitseal,

  • Quinine and quinidine are stereoisomers; they may thus have different biological activity, so it is wise to keep them separate given that we have a phenotypic readout available (vs. just structure alone, in which case it may make sense to ignore stereochemistry). Conclusion: We should treat quinidine and quinine as separate entities in this dataset
  • The two occurrences of thiostrepton: TBD

Thank you for looking into this @shntnu. Running the molecule standardisation script is definitely helpful.

When addressing the inconsistencies between SMILES representations from two different sources, we found that discrepancies can be effectively resolved by considering tautomerization and enantiomerization processes. By accounting for this phenomenon, it becomes possible to reconcile differences in SMILES strings across Target2 metadata (which did not do this step) and JUMPCP (which tried to show the lowest energy tautomer).

There is a new inconsistency however:
This compound is https://en.wikipedia.org/wiki/Geldanamycin

Converting SMILES using cheminformatics tools like RDKit might sometimes lead to discrepancies from the forms listed on resources like Wikipedia. This can be due to various reasons, including tautomerization, enantiomerization, loss of stereochemical (E/Z) information, or other subtleties in how different software handle chemical structure standardization and normalization.

Image

We also found there are 5 compounds with duplicate entries, hence 301 and not 306 unique compounds. After standardization, we don't see this problem anymore In the original Target2 dataset, they seemed different because of the following.

1.0

Image

It seems like the pair of compounds is a tautomer
2.0

Image

Thos one has a difference in one stereocenter (stated v.s. not shown)
3.0

Image

ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information.
4.0

Image

quinine and quinidine are isomers, we should keep both as "unique", they have different biological signal, so no data leak when using cell painting, but caution is advised when using ECFP, that would leak data as it cant differentiate stereochemistry.
5.0

Image
For thiostrepton, connectivity is same, it seems stereochemistry is different

@FrenkT I believe we have finally resolved everything here

Please see jump-cellpainting/JUMP-Target#32