The instances of ('H', 0) shown in SPICE paper seem unusually low
AndChenCM opened this issue · comments
Hi,
I notice that there are about 1 million molecules in SPICE dataset, but only 1594 instances of ('H', 0) as shown in Table 2 in your published paper. The number '1594' seems unusually low to me. I wonder during DFT calculations, did you use explicit hydrogens for some molecules and implicit hydrogens for the rest? Or '1594' just means the explicit 'H's that are shown in smiles?
You're absolutely right. It's only counting hydrogens that are explicitly mentioned in the SMILES string. Here's the code I used to generate that table.
from rdkit import Chem
import h5py
from collections import defaultdict
types = defaultdict(int)
infile = h5py.File('SPICE.hdf5')
for group in infile:
try:
smiles = infile[group]['smiles'][0]
mol = Chem.MolFromSmiles(smiles)
num = infile[group]['conformations'].shape[0]
for atom in mol.GetAtoms():
key = (atom.GetAtomicNum(), atom.GetSymbol(), atom.GetFormalCharge())
types[key] += num
except:
print(group)
for key in sorted(types):
print(f'{key[1]}\t{key[2]}\t{types[key]}')
I just added the line
mol = Chem.rdmolops.AddHs(mol)
to make it build implicit hydrogens, and it now gives a much more reasonable number: 15,207,949.
Sorry about that, and thanks for catching it!
Thanks for confirming that! This makes much more sense now.