openmm / spice-dataset

A collection of QM data for training potential functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The instances of ('H', 0) shown in SPICE paper seem unusually low

AndChenCM opened this issue · comments

Hi,

I notice that there are about 1 million molecules in SPICE dataset, but only 1594 instances of ('H', 0) as shown in Table 2 in your published paper. The number '1594' seems unusually low to me. I wonder during DFT calculations, did you use explicit hydrogens for some molecules and implicit hydrogens for the rest? Or '1594' just means the explicit 'H's that are shown in smiles?

You're absolutely right. It's only counting hydrogens that are explicitly mentioned in the SMILES string. Here's the code I used to generate that table.

from rdkit import Chem
import h5py
from collections import defaultdict

types = defaultdict(int)
infile = h5py.File('SPICE.hdf5')
for group in infile:
    try:
        smiles = infile[group]['smiles'][0]
        mol = Chem.MolFromSmiles(smiles)
        num = infile[group]['conformations'].shape[0]
        for atom in mol.GetAtoms():
            key = (atom.GetAtomicNum(), atom.GetSymbol(), atom.GetFormalCharge())
            types[key] += num
    except:
        print(group)
for key in sorted(types):
    print(f'{key[1]}\t{key[2]}\t{types[key]}')

I just added the line

        mol = Chem.rdmolops.AddHs(mol)

to make it build implicit hydrogens, and it now gives a much more reasonable number: 15,207,949.

Sorry about that, and thanks for catching it!

Thanks for confirming that! This makes much more sense now.