rdkit / mmpdb

A package to identify matched molecular pairs and use them to predict property changes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

mmpdb transform behaves unexpectedly

mu-wang opened this issue · comments

The transform rules in mmpdblib appears to miss some apparent cases.

A test case with the following structures:

OC(c(cccc1)c1O)=O	 mol1
CCCCCCCC(c(cc1)cc(C(O)=O)c1O)=O	mol2
CCCCCC(c(cc1)cc(C(O)=O)c1O)=O	mol3

with some properties:

ID	prop
mol1	0.0
mol2	1.0
mol3	1.5

I performed the fragmentation, index and property loading as instructed.

python -m mmpdblib fragment test_struct.tsv --max-rotatable-bonds 20 --num-cuts 3 -o test.fragments
python -m mmpdblib index test.fragments -o test.mmpdb
python -m mmpdblib loadprops --properties test_prop.tsv test.mmpdb

The indexed pairs makes sense.

However, when I run:

python -m mmpdblib transform --smiles 'OC(c(cccc1)c1O)=O' test.mmpdb --explain

I noticed that I cannot get mol2 or mol3, where the rules mol1->mol2 and mol1->mol3 is included in the index step. Did I miss something here? Thank you for your help.

Here's the explanation output:

WARNING: APSW not installed. Falling back to Python's sqlite3 module.
Processing fragment Fragmentation(1, 'N', 7, '1', '*c1ccccc1O', '0', 3, '1', '*C(=O)O', 'O=CO')
  variable '*c1ccccc1O' not found as SMILES '[*:1]c1ccccc1O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 3, '1', '*C(=O)O', '0', 7, '1', '*c1ccccc1O', 'Oc1ccccc1')
  variable '*C(=O)O' not found as SMILES '[*:1]C(=O)O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*C(=O)O.*O', None)
  variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
  variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 1, '1', '*O', '0', 9, '1', '*c1ccccc1C(=O)O', 'O=C(O)c1ccccc1')
  variable '*O' not found as SMILES '[*:1]O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(1, 'N', 9, '1', '*c1ccccc1C(=O)O', '0', 1, '1', '*O', 'O')
  variable '*c1ccccc1C(=O)O' not found as SMILES '[*:1]c1ccccc1C(=O)O'
  No matching rule SMILES found. Skipping fragment.
Processing fragment Fragmentation(2, 'N', 6, '11', '*c1ccccc1*', '01', 4, '12', '*O.*C(=O)O', None)
  variable '*c1ccccc1*' not found as SMILES '[*:1]c1ccccc1[*:2]'
  variable '*c1ccccc1*' not found as SMILES '[*:2]c1ccccc1[*:1]'
  No matching rule SMILES found. Skipping fragment.
== Product SMILES in database: 0 ==
ID      SMILES  prop_from_smiles        prop_to_smiles  prop_radius     prop_fingerprint      prop_rule_environment_id        prop_count      prop_avg        prop_std      
  prop_kurtosis prop_skewness   prop_min        prop_q1 prop_median     prop_q3 prop_max      prop_paired_t   prop_p_value

I believe what's happening is that transform works on the variable part, but hydrogens aren't treated as the variable *[H] but instead are treated as a special case.

If so, I don't remember if transformation from a hydrogen was deliberately not included in the "transform" operation, or if it was an oversight.

As Jérôme and Christian point out, hydrogen transformations were explicitly not included as there would be too many.

The transform option lets you specify a specific hydrogen to consider, by denoting it with an explicit [H] in the SMILES string.

However, that code path has not been used for years and it does not work in the main mmpdb release. (RDKit changed its wildcard representation from [*] to * about five years ago, and mmpdb used a hard-coded [*][H] to recognize the cut hydrogen SMILES fragment.)

The fixed code is available in the v3 development version, available from https://github.com/adalke/mmpdb/tree/v3-dev .

Hi @adalke , thank you for your help. I will try the v3-dev version of mmpdb.

Hello. I am restarting this thread as I have a follow up question. I am able to specify a hydrogen to consider by denoting it with an explicit [H] in the SMILES string. I also note that one can specify multiple hydrogens this way and vary them all.

However, it appears that if you specify one or more hydrogens then only the specified hydrogen position(s) are modified, with the rest of the molecule remaining unchanged. Is there a way to vary the hydrogen(s) and the rest of the molecule? Or a flag to vary all hydrogens? I understand this may generate large numbers of compounds.

Thanks.

Hi DJ,

back when we designed mmpdb for the first time, we found that the number of compounds generated can become extremely large if you allow for both H and Fragment exchanges. In fact, depending on your database, the number of compounds generated can already become really large if you allow for replacing all hydrogens. We therefore decided to only allow either exchange of explicit hydrogens or fragments that include at least one heavy atom.
You can change this behaviour, but you'd have to hack the code a bit. Before doing that, I'd recommend you test whether the output is still manageable by what you want to do with it - just make all hydrogens explicit in the input molecule and see what happens.

Bests,
Christian