How to correctly parse the SMILES of the PubChem dataset?
raimis opened this issue · comments
The SMILES of the PubChem dataset are generated with OpenFF-Toolkit (https://github.com/openmm/spice-dataset/blob/main/pubchem/createPubchem.py). So, Molecule
from OpenFF-Toolkit should be able read them correctly, but this isn't a case.
Get a SMILES:
import h5py
h5 = h5py.File('pubchem/pubchem-1-2500.hdf5')
smiles = h5['103914790']['smiles'][0]
print(smiles)
b'[N:1]1=[C:2]2[N:3]([C:5]([H:17])([H:18])[C:4]1([H:15])[H:16])[C:12]1([H:30])[C:8]([H:23])([H:24])[C:13]3([H:31])[C:6]([H:19])([H:20])[C:11]2([H:29])[C:7]([H:21])([H:22])[C:14]([H:32])([C:9]1([H:25])[H:26])[C:10]3([H:27])[H:28]'
Parse the SMILES and print elements:
from openff.toolkit.topology import Molecule
mol = Molecule.from_smiles(smiles, hydrogens_are_explicit=True, allow_undefined_stereo=True)
print([atom.element.symbol for atom in mol.atoms])
['N', 'C', 'N', 'C', 'H', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'C', 'H', 'H', 'C', 'H', 'H']
Despite the SMILES contains the explicit hydrogen and atom indices, the order of atom doesn't match, e.g. the 5th atom in the SMILE is C
, but in the molecule it is H
.
Hi @raimis , please use Molecule.from_mapped_smiles()
which retains the atom mapping.
mol = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)
@pavankum thanks! I haven't noticed that in the documentation.
Thank you for the feedback, we will make sure to update the documentation.