Resolve inconsistencies in Target2 Compound InChIKeys
FrenkT opened this issue · comments
Hi all,
As a follow up from #77, I have been trying to map compound identifiers mentioned in the Target-2 plate map and metadata with compound identifiers provided for Target-2 plates in the JUMP metadata files.
As a result, I found 36 (out of 384) wells for which the compound in the JUMP metadata doesn't match the Target-2 metadata:
Well | InChI Expected | InChI Found |
---|---|---|
A03 | KRGQEOSDQHTZMX-IGCYCDGOSA-N | LPYXWGMUVRGUOY-UHFFFAOYSA-N |
A06 | ODHCTXKNWHHXJC-VKHMYHEASA-N | GUUGZPSUOTWOMD-UHFFFAOYSA-N |
A12 | NSFFHOGKXHRQEW-DVRIZHICSA-N | UTBOEBCWXGDOGI-UHFFFAOYSA-N |
B01 | LLPBUXODFQZPFH-UHFFFAOYSA-N | AJVXVYTVAAWZAP-UHFFFAOYSA-N |
B05 | CVOUSAVHMDXCKG-UHFFFAOYSA-N | ROBYKNONIPZMTK-UHFFFAOYSA-N |
B24 | QTQAWLPCGQOSGP-PHLMVCJGSA-N | HGMSUJCQIUFZBJ-UHFFFAOYSA-N |
C13 | CXJCGSPAPOTTSF-VURMDHGXSA-N | DXZRBHUCOHBAHP-UHFFFAOYSA-N |
C24 | HTIQEAQVCYTUBX-UHFFFAOYSA-N | YMDXSGBNCBQYGC-UHFFFAOYSA-N |
D02 | LXENKEWVEVKKGV-BQYQJAHWSA-N | VSVFLGPUZJTBSD-UHFFFAOYSA-N |
D08 | BMKPVDQDJQWBPD-UHFFFAOYSA-N | DUKQPWDVIZDABV-UHFFFAOYSA-N |
E04 | RZTAMFZIAATZDJ-UHFFFAOYSA-N | HAQDEJPEAKWAAM-UHFFFAOYSA-N |
F18 | KOCVKGYKBLJEPK-LYBHJNIJSA-N | WRLVHADVOGFZOZ-UHFFFAOYSA-N |
F23 | KVWDHTXUZHCGIO-UHFFFAOYSA-N | WXPNDRBBWZMPQG-UHFFFAOYSA-N |
G05 | WBGKWQHBNHJJPZ-LECWWXJVSA-N | ZMUSCGJNJYXJBP-UHFFFAOYSA-N |
G06 | POJZIZBONPAWIV-UHFFFAOYSA-N | GQXSULRYFDAMOO-UHFFFAOYSA-N |
G15 | VYMDGNCVAMGZFE-UHFFFAOYSA-N | PKYKNPLSFOKASK-UHFFFAOYSA-N |
H24 | UREBDLICKHMUKA-QCYOSJOCSA-N | GJFCONYVAUNLKB-UHFFFAOYSA-N |
I06 | KGPGQDLTDHGEGT-SZUNQUCBSA-N | LQERMDXPGNOJCT-UHFFFAOYSA-N |
I18 | VDJHFHXMUKFKET-WDUFCVPESA-N | HULPONUAINYLQQ-UHFFFAOYSA-N |
J02 | NSFFHOGKXHRQEW-AIHSUZKVSA-N | UTBOEBCWXGDOGI-UHFFFAOYSA-N |
J07 | XKFTZKGMDDZMJI-HSZRJFAPSA-N | KRBSMMVJJVHVCB-UHFFFAOYSA-N |
J14 | XKFTZKGMDDZMJI-HSZRJFAPSA-N | KRBSMMVJJVHVCB-UHFFFAOYSA-N |
K02 | UREBDLICKHMUKA-CXSFZGCWSA-N | GJFCONYVAUNLKB-UHFFFAOYSA-N |
K05 | GIUYCYHIANZCFB-FJFJXFQQSA-N | CAOWNCTTWGSKDO-UHFFFAOYSA-N |
K13 | SJFBTAPEPRWNKH-CCKFTAQKSA-N | XUZQTIZWMHMWOC-UHFFFAOYSA-N |
L06 | OHRURASPPZQGQM-GCCNXGTGSA-N | SOOPLNPQGWJZHY-UHFFFAOYSA-N |
L11 | NHFDRBXTEDBWCZ-ZROIWOOFSA-N | GMROZDPZEUVIGD-UHFFFAOYSA-N |
N09 | MBGGBVCUIVRRBF-UHFFFAOYSA-N | AUMHDRMJJNZTPB-UHFFFAOYSA-N |
N14 | HTSLEZOTMYUPLU-UHFFFAOYSA-N | AGNWVEJTZJIJIM-UHFFFAOYSA-N |
O10 | DEQANNDTNATYII-UHFFFAOYSA-N | JDKKNQACNITFEA-UHFFFAOYSA-N |
O14 | FAIIFDPAEUKBEP-UHFFFAOYSA-N | KJWGEXJCWCYEMI-UHFFFAOYSA-N |
P01 | AOZPVMOOEJAZGK-UHFFFAOYSA-N | UXUQIRNFBFRPAC-UHFFFAOYSA-N |
P03 | HFPLHASLIOXVGS-UHFFFAOYSA-N | CANBMWXJDLUDFF-UHFFFAOYSA-N |
P12 | UIAGMCDKSXEBJQ-UHFFFAOYSA-N | SVMHYHIZWOJKDL-UHFFFAOYSA-N |
P18 | ZDXUKAKRHYTAKV-UHFFFAOYSA-N | PHOGQKDIVUJGMJ-UHFFFAOYSA-N |
P23 | YYDUWLSETXNJJT-MTJSOVHGSA-N | LNFZRMDSZJCZTG-UHFFFAOYSA-N |
As you can see, the first layer of the InChIKey is different, so the mismatch shouldn't be due to just missing stereochemical information.
Note that each row of the table applies to all of the TARGET2 plates described in the metadata files except for those coming from source_9
(I excluded these from my analysis code because they have a 1536 well layout and I wanted to keep things simple, see #77) and plate CP1-SC2-25
from source_7
(similarly, because it seems like the plate has a mirrored layout, see #77). So I ran this check on 131 plates, and for all of them I can find the differences described in the table above.
Any idea on whether the compounds used in the experiments are actually different, as suggested by the InChIKeys, or whether there is some issue in the metadata files provided in this repo?
Hi @FrenkT,
Thank you for the detailed report and bringing this to our attention. I can confirm that the compounds are not different, which would mean that the metadata provided is not correct. I will look into this and talk to others who know more about how the metadata was generated and get back to you.
@srijitseal will post more insights later, but for now, it appears that running StandardizeMolecule.py on JUMP-Target-2_compound_metadata.tsv resolves the discrepancies
However, it does add a new discrepancy – we now have 301 unique entries, not 303, in the Target2 set.
That's because in addition to "duplicates" of BVT-948, dexamethasone, and thiostrepton (noted here) jump-cellpainting/JUMP-Target#9 (comment), we also notes "duplicates" of ME-0328 and quinidine/quinine
https://chat.openai.com/share/fa71ea33-0bc7-4699-b03a-a0ba41353164
SMILES_standardized pert_iname
22 CC1(C)C(=O)N=C2c3ccccc3C(=O)C(=O)C21 BVT-948
266 CC1(C)C(=O)N=C2c3ccccc3C(=O)C(=O)C21 BVT-948
85 CC(NC(=O)CCc1nc(=O)c2ccccc2[nH]1)c1ccccc1 ME-0328
146 CC(NC(=O)CCc1nc(=O)c2ccccc2[nH]1)c1ccccc1 ME-0328
150 CC1CC2C3CC=C4CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O... dexamethasone
191 CC1CC2C3CC=C4CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O... dexamethasone
143 C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12 quinidine
1 C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12 quinine
10 C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC... thiostrepton
171 C=C(NC(=O)C(=C)NC(=O)c1csc(C2=NC3c4csc(n4)C4NC... thiostrepton
In any case, this should keep you going for now @FrenkT
Updates
Per @srijitseal,
- Quinine and quinidine are stereoisomers; they may thus have different biological activity, so it is wise to keep them separate given that we have a phenotypic readout available (vs. just structure alone, in which case it may make sense to ignore stereochemistry). Conclusion: We should treat quinidine and quinine as separate entities in this dataset
- The two occurrences of thiostrepton: TBD
Thank you for looking into this @shntnu. Running the molecule standardisation script is definitely helpful.
When addressing the inconsistencies between SMILES representations from two different sources, we found that discrepancies can be effectively resolved by considering tautomerization and enantiomerization processes. By accounting for this phenomenon, it becomes possible to reconcile differences in SMILES strings across Target2 metadata (which did not do this step) and JUMPCP (which tried to show the lowest energy tautomer).
There is a new inconsistency however:
This compound is https://en.wikipedia.org/wiki/Geldanamycin
Converting SMILES using cheminformatics tools like RDKit might sometimes lead to discrepancies from the forms listed on resources like Wikipedia. This can be due to various reasons, including tautomerization, enantiomerization, loss of stereochemical (E/Z) information, or other subtleties in how different software handle chemical structure standardization and normalization.
We also found there are 5 compounds with duplicate entries, hence 301 and not 306 unique compounds. After standardization, we don't see this problem anymore In the original Target2 dataset, they seemed different because of the following.
1.0
It seems like the pair of compounds is a tautomer
2.0
Thos one has a difference in one stereocenter (stated v.s. not shown)
3.0
ME-0328 seems to be indeed a duplicate, where one of the repeats doesn't have stereochemistry information.
4.0
quinine and quinidine are isomers, we should keep both as "unique", they have different biological signal, so no data leak when using cell painting, but caution is advised when using ECFP, that would leak data as it cant differentiate stereochemistry.
5.0
For thiostrepton, connectivity is same, it seems stereochemistry is different
@FrenkT I believe we have finally resolved everything here
Please see jump-cellpainting/JUMP-Target#32