How to correctly parse the SMILES of the PubChem dataset?

Question

How to correctly parse the SMILES of the PubChem dataset?

raimis opened this issue 3 years ago · comments

Raimondas Galvelis commented 3 years ago

The SMILES of the PubChem dataset are generated with OpenFF-Toolkit (https://github.com/openmm/spice-dataset/blob/main/pubchem/createPubchem.py). So, Molecule from OpenFF-Toolkit should be able read them correctly, but this isn't a case.

Get a SMILES:

import h5py

h5 = h5py.File('pubchem/pubchem-1-2500.hdf5')
smiles = h5['103914790']['smiles'][0]
print(smiles)

b'[N:1]1=[C:2]2[N:3]([C:5]([H:17])([H:18])[C:4]1([H:15])[H:16])[C:12]1([H:30])[C:8]([H:23])([H:24])[C:13]3([H:31])[C:6]([H:19])([H:20])[C:11]2([H:29])[C:7]([H:21])([H:22])[C:14]([H:32])([C:9]1([H:25])[H:26])[C:10]3([H:27])[H:28]'

Parse the SMILES and print elements:

from openff.toolkit.topology import Molecule

mol = Molecule.from_smiles(smiles, hydrogens_are_explicit=True, allow_undefined_stereo=True)
print([atom.element.symbol for atom in mol.atoms])

['N', 'C', 'N', 'C', 'H', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'H']

Despite the SMILES contains the explicit hydrogen and atom indices, the order of atom doesn't match, e.g. the 5th atom in the SMILE is C, but in the molecule it is H.

Pavan Behara · Answer 1 · Wed Feb 09 2022 04:54:17 GMT+0800 (China Standard Time)

Hi @raimis , please use Molecule.from_mapped_smiles() which retains the atom mapping.

mol = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)

Raimondas Galvelis · Answer 2 · Wed Feb 09 2022 22:15:45 GMT+0800 (China Standard Time)

@pavankum thanks! I haven't noticed that in the documentation.

Pavan Behara · Answer 3 · Wed Feb 09 2022 23:10:00 GMT+0800 (China Standard Time)

Thank you for the feedback, we will make sure to update the documentation.