The instances of ('H', 0) shown in SPICE paper seem unusually low

Question

The instances of ('H', 0) shown in SPICE paper seem unusually low

AndChenCM opened this issue a year ago · comments

Hi,

I notice that there are about 1 million molecules in SPICE dataset, but only 1594 instances of ('H', 0) as shown in Table 2 in your published paper. The number '1594' seems unusually low to me. I wonder during DFT calculations, did you use explicit hydrogens for some molecules and implicit hydrogens for the rest? Or '1594' just means the explicit 'H's that are shown in smiles?

Peter Eastman · Answer 1 · Thu Apr 20 2023 07:22:41 GMT+0800 (China Standard Time)

You're absolutely right. It's only counting hydrogens that are explicitly mentioned in the SMILES string. Here's the code I used to generate that table.

from rdkit import Chem
import h5py
from collections import defaultdict

types = defaultdict(int)
infile = h5py.File('SPICE.hdf5')
for group in infile:
    try:
        smiles = infile[group]['smiles'][0]
        mol = Chem.MolFromSmiles(smiles)
        num = infile[group]['conformations'].shape[0]
        for atom in mol.GetAtoms():
            key = (atom.GetAtomicNum(), atom.GetSymbol(), atom.GetFormalCharge())
            types[key] += num
    except:
        print(group)
for key in sorted(types):
    print(f'{key[1]}\t{key[2]}\t{types[key]}')

I just added the line

        mol = Chem.rdmolops.AddHs(mol)

to make it build implicit hydrogens, and it now gives a much more reasonable number: 15,207,949.

Sorry about that, and thanks for catching it!

Mingan Chen · Answer 2 · Thu Apr 20 2023 12:31:22 GMT+0800 (China Standard Time)

Thanks for confirming that! This makes much more sense now.