openmm / spice-dataset

A collection of QM data for training potential functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to correctly parse the SMILES of the PubChem dataset?

raimis opened this issue · comments

The SMILES of the PubChem dataset are generated with OpenFF-Toolkit (https://github.com/openmm/spice-dataset/blob/main/pubchem/createPubchem.py). So, Molecule from OpenFF-Toolkit should be able read them correctly, but this isn't a case.

Get a SMILES:

import h5py

h5 = h5py.File('pubchem/pubchem-1-2500.hdf5')
smiles = h5['103914790']['smiles'][0]
print(smiles)
b'[N:1]1=[C:2]2[N:3]([C:5]([H:17])([H:18])[C:4]1([H:15])[H:16])[C:12]1([H:30])[C:8]([H:23])([H:24])[C:13]3([H:31])[C:6]([H:19])([H:20])[C:11]2([H:29])[C:7]([H:21])([H:22])[C:14]([H:32])([C:9]1([H:25])[H:26])[C:10]3([H:27])[H:28]'

Parse the SMILES and print elements:

from openff.toolkit.topology import Molecule

mol = Molecule.from_smiles(smiles, hydrogens_are_explicit=True, allow_undefined_stereo=True)
print([atom.element.symbol for atom in mol.atoms])
['N', 'C', 'N', 'C', 'H', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'H']

Despite the SMILES contains the explicit hydrogen and atom indices, the order of atom doesn't match, e.g. the 5th atom in the SMILE is C, but in the molecule it is H.

Hi @raimis , please use Molecule.from_mapped_smiles() which retains the atom mapping.

mol = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)

@pavankum thanks! I haven't noticed that in the documentation.

Thank you for the feedback, we will make sure to update the documentation.