Resolve inconsistencies in Target2 Compound InChIKeys

Question

Resolve inconsistencies in Target2 Compound InChIKeys

FrenkT opened this issue a year ago · comments

Hi all,

As a follow up from #77, I have been trying to map compound identifiers mentioned in the Target-2 plate map and metadata with compound identifiers provided for Target-2 plates in the JUMP metadata files.
As a result, I found 36 (out of 384) wells for which the compound in the JUMP metadata doesn't match the Target-2 metadata:

Well	InChI Expected	InChI Found
A03	KRGQEOSDQHTZMX-IGCYCDGOSA-N	LPYXWGMUVRGUOY-UHFFFAOYSA-N
A06	ODHCTXKNWHHXJC-VKHMYHEASA-N	GUUGZPSUOTWOMD-UHFFFAOYSA-N
A12	NSFFHOGKXHRQEW-DVRIZHICSA-N	UTBOEBCWXGDOGI-UHFFFAOYSA-N
B01	LLPBUXODFQZPFH-UHFFFAOYSA-N	AJVXVYTVAAWZAP-UHFFFAOYSA-N
B05	CVOUSAVHMDXCKG-UHFFFAOYSA-N	ROBYKNONIPZMTK-UHFFFAOYSA-N
B24	QTQAWLPCGQOSGP-PHLMVCJGSA-N	HGMSUJCQIUFZBJ-UHFFFAOYSA-N
C13	CXJCGSPAPOTTSF-VURMDHGXSA-N	DXZRBHUCOHBAHP-UHFFFAOYSA-N
C24	HTIQEAQVCYTUBX-UHFFFAOYSA-N	YMDXSGBNCBQYGC-UHFFFAOYSA-N
D02	LXENKEWVEVKKGV-BQYQJAHWSA-N	VSVFLGPUZJTBSD-UHFFFAOYSA-N
D08	BMKPVDQDJQWBPD-UHFFFAOYSA-N	DUKQPWDVIZDABV-UHFFFAOYSA-N
E04	RZTAMFZIAATZDJ-UHFFFAOYSA-N	HAQDEJPEAKWAAM-UHFFFAOYSA-N
F18	KOCVKGYKBLJEPK-LYBHJNIJSA-N	WRLVHADVOGFZOZ-UHFFFAOYSA-N
F23	KVWDHTXUZHCGIO-UHFFFAOYSA-N	WXPNDRBBWZMPQG-UHFFFAOYSA-N
G05	WBGKWQHBNHJJPZ-LECWWXJVSA-N	ZMUSCGJNJYXJBP-UHFFFAOYSA-N
G06	POJZIZBONPAWIV-UHFFFAOYSA-N	GQXSULRYFDAMOO-UHFFFAOYSA-N
G15	VYMDGNCVAMGZFE-UHFFFAOYSA-N	PKYKNPLSFOKASK-UHFFFAOYSA-N
H24	UREBDLICKHMUKA-QCYOSJOCSA-N	GJFCONYVAUNLKB-UHFFFAOYSA-N
I06	KGPGQDLTDHGEGT-SZUNQUCBSA-N	LQERMDXPGNOJCT-UHFFFAOYSA-N
I18	VDJHFHXMUKFKET-WDUFCVPESA-N	HULPONUAINYLQQ-UHFFFAOYSA-N
J02	NSFFHOGKXHRQEW-AIHSUZKVSA-N	UTBOEBCWXGDOGI-UHFFFAOYSA-N
J07	XKFTZKGMDDZMJI-HSZRJFAPSA-N	KRBSMMVJJVHVCB-UHFFFAOYSA-N
J14	XKFTZKGMDDZMJI-HSZRJFAPSA-N	KRBSMMVJJVHVCB-UHFFFAOYSA-N
K02	UREBDLICKHMUKA-CXSFZGCWSA-N	GJFCONYVAUNLKB-UHFFFAOYSA-N
K05	GIUYCYHIANZCFB-FJFJXFQQSA-N	CAOWNCTTWGSKDO-UHFFFAOYSA-N
K13	SJFBTAPEPRWNKH-CCKFTAQKSA-N	XUZQTIZWMHMWOC-UHFFFAOYSA-N
L06	OHRURASPPZQGQM-GCCNXGTGSA-N	SOOPLNPQGWJZHY-UHFFFAOYSA-N
L11	NHFDRBXTEDBWCZ-ZROIWOOFSA-N	GMROZDPZEUVIGD-UHFFFAOYSA-N
N09	MBGGBVCUIVRRBF-UHFFFAOYSA-N	AUMHDRMJJNZTPB-UHFFFAOYSA-N
N14	HTSLEZOTMYUPLU-UHFFFAOYSA-N	AGNWVEJTZJIJIM-UHFFFAOYSA-N
O10	DEQANNDTNATYII-UHFFFAOYSA-N	JDKKNQACNITFEA-UHFFFAOYSA-N
O14	FAIIFDPAEUKBEP-UHFFFAOYSA-N	KJWGEXJCWCYEMI-UHFFFAOYSA-N
P01	AOZPVMOOEJAZGK-UHFFFAOYSA-N	UXUQIRNFBFRPAC-UHFFFAOYSA-N
P03	HFPLHASLIOXVGS-UHFFFAOYSA-N	CANBMWXJDLUDFF-UHFFFAOYSA-N
P12	UIAGMCDKSXEBJQ-UHFFFAOYSA-N	SVMHYHIZWOJKDL-UHFFFAOYSA-N
P18	ZDXUKAKRHYTAKV-UHFFFAOYSA-N	PHOGQKDIVUJGMJ-UHFFFAOYSA-N
P23	YYDUWLSETXNJJT-MTJSOVHGSA-N	LNFZRMDSZJCZTG-UHFFFAOYSA-N

As you can see, the first layer of the InChIKey is different, so the mismatch shouldn't be due to just missing stereochemical information.
Note that each row of the table applies to all of the TARGET2 plates described in the metadata files except for those coming from source_9 (I excluded these from my analysis code because they have a 1536 well layout and I wanted to keep things simple, see #77) and plate CP1-SC2-25 from source_7 (similarly, because it seems like the plate has a mirrored layout, see #77). So I ran this check on 131 plates, and for all of them I can find the differences described in the table above.

Any idea on whether the compounds used in the experiments are actually different, as suggested by the InChIKeys, or whether there is some issue in the metadata files provided in this repo?

Niranj Chandrasekaran · Answer 1 · Fri Aug 25 2023 03:29:55 GMT+0800 (China Standard Time)

Hi @FrenkT,

Thank you for the detailed report and bringing this to our attention. I can confirm that the compounds are not different, which would mean that the metadata provided is not correct. I will look into this and talk to others who know more about how the metadata was generated and get back to you.

Shantanu Singh · Answer 2 · Wed Nov 01 2023 03:41:00 GMT+0800 (China Standard Time)

@srijitseal will post more insights later, but for now, it appears that running StandardizeMolecule.py on JUMP-Target-2_compound_metadata.tsv resolves the discrepancies

However, it does add a new discrepancy – we now have 301 unique entries, not 303, in the Target2 set.

That's because in addition to "duplicates" of BVT-948, dexamethasone, and thiostrepton (noted here) jump-cellpainting/JUMP-Target#9 (comment), we also notes "duplicates" of ME-0328 and quinidine/quinine

https://chat.openai.com/share/fa71ea33-0bc7-4699-b03a-a0ba41353164

                                   SMILES_standardized     pert_iname  
22                CC1(C)C(=O)N=C2c3ccccc3C(=O)C(=O)C21        BVT-948  
266               CC1(C)C(=O)N=C2c3ccccc3C(=O)C(=O)C21        BVT-948  
85           CC(NC(=O)CCc1nc(=O)c2ccccc2[nH]1)c1ccccc1        ME-0328  
146          CC(NC(=O)CCc1nc(=O)c2ccccc2[nH]1)c1ccccc1        ME-0328  
150  CC1CC2C3CC=C4CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O...  dexamethasone  
191  CC1CC2C3CC=C4CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O...  dexamethasone  
143              C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12      quinidine  
1                C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12        quinine  
10   C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC...   thiostrepton  
171  C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC...   thiostrepton

In any case, this should keep you going for now @FrenkT

Updates

Per @srijitseal,

Quinine and quinidine are stereoisomers; they may thus have different biological activity, so it is wise to keep them separate given that we have a phenotypic readout available (vs. just structure alone, in which case it may make sense to ignore stereochemistry). Conclusion: We should treat quinidine and quinine as separate entities in this dataset
The two occurrences of thiostrepton: TBD

Francesco Tuveri · Answer 3 · Wed Feb 14 2024 19:20:26 GMT+0800 (China Standard Time)

Thank you for looking into this @shntnu. Running the molecule standardisation script is definitely helpful.

Srijit Seal · Answer 4 · Mon Mar 18 2024 12:18:37 GMT+0800 (China Standard Time)

When addressing the inconsistencies between SMILES representations from two different sources, we found that discrepancies can be effectively resolved by considering tautomerization and enantiomerization processes. By accounting for this phenomenon, it becomes possible to reconcile differences in SMILES strings across Target2 metadata (which did not do this step) and JUMPCP (which tried to show the lowest energy tautomer).

Srijit Seal · Answer 5 · Mon Mar 18 2024 12:23:39 GMT+0800 (China Standard Time)

There is a new inconsistency however:
This compound is https://en.wikipedia.org/wiki/Geldanamycin

Converting SMILES using cheminformatics tools like RDKit might sometimes lead to discrepancies from the forms listed on resources like Wikipedia. This can be due to various reasons, including tautomerization, enantiomerization, loss of stereochemical (E/Z) information, or other subtleties in how different software handle chemical structure standardization and normalization.

Srijit Seal · Answer 6 · Mon Mar 18 2024 12:26:18 GMT+0800 (China Standard Time)

We also found there are 5 compounds with duplicate entries, hence 301 and not 306 unique compounds. After standardization, we don't see this problem anymore In the original Target2 dataset, they seemed different because of the following.

1.0

It seems like the pair of compounds is a tautomer
2.0

Thos one has a difference in one stereocenter (stated v.s. not shown)
3.0

ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information.
4.0

quinine and quinidine are isomers, we should keep both as "unique", they have different biological signal, so no data leak when using cell painting, but caution is advised when using ECFP, that would leak data as it cant differentiate stereochemistry.
5.0

For thiostrepton, connectivity is same, it seems stereochemistry is different

Shantanu Singh · Answer 7 · Thu Apr 04 2024 08:20:39 GMT+0800 (China Standard Time)

@FrenkT I believe we have finally resolved everything here

Please see jump-cellpainting/JUMP-Target#32