openmm / spice-dataset

A collection of QM data for training potential functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Additions for version 2

peastman opened this issue · comments

I've been talking with lots of people about what would be most useful to add in version 2 of SPICE. I've gotten lots of great suggestions. Here are some ideas in no particular order.

  • Solvated PubChem Molecules. This would provide data on ligand-water interactions. They could be generated the same way we did the solvated amino acids: simulate the molecule in a box of water, periodically saving conformations along with ~20 of the nearest water molecules. These will be largish, and therefore expensive to compute. We'll probably need to limit the number of molecules we include and the number of conformations for each one. I think this is a case where a little data can go a long way, since each conformation has many water molecules.

  • Water Clusters. This would provide some conformations that are more representative of bulk water. How large should the clusters be? What's the best way to generate them? I've been advised that classical water models often don't produce very realistic water structure.

  • PubChem Molecules with Boron (and possibly Silicon). I excluded these elements from the original PubChem set. They only appear in a tiny fraction of molecules, so there wouldn't have been enough data on them to be useful. But if we specifically select molecules that include them, I think we could assemble a useful amount of data. There's around 1500 molecules with B and around 1900 with Si.

  • Amino Acid / Small Molecule Pairs. This would provide more data on protein-ligand interactions. One way of generating them would be to use PDB structures. We could identify which amino acids contact the ligand and generate conformations for them, each conformation including the ligand plus a single amino acid.

  • Transition States. If you're interested in rates, it's important to model transition states accurately. Because they're higher energy, they won't be very well sampled in a thermal distribution. Can we add conformations specifically to sample transition states? One suggestion was to scan dihedral rotations while minimizing all other degrees of freedom. What molecules would it be useful to do this with? Of course, not all transition states involve dihedral rotations. What else could we do?

  • Metals. The goal would be to allow simulating metalloproteins. These could also be generated from PDB structures, taking the amino acids that interact with the metal. Would it be sufficient to consider only one amino acid at a time, or do you really need to include the whole environment? If the latter, these could be quite expensive to calculate.

  • Active Learning. This is the idea discussed in #35, to add more conformations for existing molecules to improve the accuracy of trained models.

  • More PubChem Molecules. More chemical diversity is always good. The number included in version 1 was mostly arbitrary, just based on how long it took to do the calculations. It would be easy to add more.

These are good ideas!

Here are some other ideas, some of which have been mentioned before but are not listed above:

  • Enamine building blocks. These are chemical "building blocks" used in reactions at Enamine or elsewhere for synthetic organic chemistry for drug discovery. We can download the SMILES libraries, and if there is any concern in securing permission to use these compound libraries, we can ask for that (we work with Enamine closely). While PubChem provides coverage of molecules that have already been made, good coverage of building blocks provides coverage of molecules likely to be made in the future. If size is an issue, it could be filtered by size and/or clustered by similarity.

  • Enamine REALSpace representative couplings. These would be a small subset of representative couplings between small building blocks to provide coverage of the chemistries joining synthetic building blocks together. These could be generated via common robust SMIRKS reaction transformations that couple together compatible SMILES for building blocks.

  • Common post-translational modifications of proteins. A large number of post-translational modifications are commonly found for proteins, so confining ourselves to natural amino acids is severely limiting for studying real biomolecular systems.

  • The PDB Chemical Component Dictionary. This is a catch-all of everything anyone has cared enough about to obtain a structure. There are statistics available for how many times each entity appears that can provide an easy way to capture the smallest number of most important residues.

  • Lipids. Lipids are important for membranes and membrane proteins, but there poor coverage of these in quantum chemical datasets. There are databases of lipids and lipid structures that could be tapped. Lipid nanoparticles are also incredibly important right now.

  • Carbohydrates. Complex carbohydrates are attached to many eukaryotic proteins, and are composed of a relatively small number of building blocks and common linkages. A representative set can readily be generated if someone hasn't done this already.

  • Nucleic acids. Nucleic acid coverage is important for simulating RNA and DNA, which are both important for understanding biomolecular mechanisms and also targets for new technologies like RNA vaccines, CRISPR, and targeting RNA with small molecules.

How should we prioritize the various ideas? Is there a simple voting scheme that could be used to capture interest? Or is it better to focus on an MVP that delivers clear value to many users.

  • One compelling MVP is "being able to simulate most of the PDB". Since structural information is required for molecular simulations and the PDB is the primary source of structural data, tackling the most popular chemical entities from the PDB Chemical Component Dictionary (and perhaps common couplings they are involved in) would be extremely high value.

  • Much of the drug discovery world now uses the 38B+ compound Enamine REALSpace one-step synthetic library for early stage discovery, so delivering greater coverage and accuracy for these compounds would be very high value.

  • Nucleic acid, lipid, and carbohydrate force fields are critical for representing many biomolecular systems that are difficult to simulate right now. Providing greater coverage of these with targeted datasets would be highly valuable.

Back when we designed version 1, we discussed including Enamine molecules. You were going to talk to them and try to get permission, but you never succeeded. If you want to try again and see if they'll give us permission now to use their molecules, go for it!

Peptide macrocycles are another class of interesting biomolecules.

Are there any covalent or non-covalent interactions in cyclic peptides that are different from regular ones?

I don't think so, mostly hydrogen bonds, capturing the ring puckering right was on my mind since it is conformationally flexible.

Back when we designed version 1, we discussed including Enamine molecules. You were going to talk to them and try to get permission, but you never succeeded. If you want to try again and see if they'll give us permission now to use their molecules, go for it!

Thanks for the reminder! I've just followed up on this.

Are there any covalent or non-covalent interactions in cyclic peptides that are different from regular ones?

For cyclic peptides, there are a number of staple peptides that make use of nonstandard chemical linkages. I'm trying to find a database that contains many examples of these that we might be able to use (or fragment and use).

I also came across this paper on constructing virtual databases of cyclic peptides that doesn't seem to include hydrocarbon stapling but does include some non-natural amino acid diversity:
https://pubs.acs.org/doi/10.1021/ci100431r

We might also consider dyes that are frequently used as spectroscopic labels in biomolecules. There are some databases available.

These molecules (and likely the metals dataset) may have certain considerations in selecting an appropriate basis set or level of theory for quantum chemistry.

Thanks for the reminder! I've just followed up on this.

All we need from them is a list of SMILES strings. No other metadata of any sort is required. But the list of SMILES strings has to be in the public domain, because it will become part of the dataset. If they claim a copyright on it then we can't include it as part of a public domain dataset, even if they offer permissive licensing terms.

I'm looking into how to generate a set of amino acid / small molecule pairs. From Ligand Expo I can download SMILES strings for every registered residue (there are over 40,000), and a list of every PDB entry it appears in. From there I can find every amino acid that contacts it and extract them one at a time.

The question is how to select useful molecules. There are some obvious filters I can apply:

  • Not a standard amino acid.
  • The number of atoms is in some reasonable range (maybe about 5-50).
  • It appears in some minimum number of PDB structures (though does this really matter?).
  • It can be parametrized with OpenFF.

Any other suggestions? How would you select useful molecules?

I'm looking into how to generate a set of amino acid / small molecule pairs.

I'd highly recommend tackling this dataset first, since it (1) is much more immediately valuable, and (2) is a subset of what would be needed for the amino acid / small molecule pairs dataset:

The PDB Chemical Component Dictionary. This is a catch-all of everything anyone has cared enough about to obtain a structure. There are statistics available for how many times each entity appears that can provide an easy way to capture the smallest number of most important residues.

From Ligand Expo I can download SMILES strings for every registered residue (there are over 40,000), and a list of every PDB entry it appears in. From there I can find every amino acid that contacts it and extract them one at a time.

Here's what I would do for prioritizing molecules from the Chemical Components Dictionary:

  1. Sort the list by number of times each component appears in the PDB, prioritizing the ones that appear most frequently. If we have to truncate this list, it's best to exclude the very rare molecules. For example, if we needed to drop some molecules out, we would omit those that only occur once, since they are likely experimental ligands, and we're relying on other datasets to cover those.
  2. Apply a max heavy atom cutoff to keep QC calculation times practical.
  3. Use either a force field (e.g. OpenFF, RDKit's MMFF) or a charging model like AM1-BCC (via the OpenFF toolkit) as a filter to cull molecules with elements that are too exotic for a first-pass (e.g. platinum)
  4. Expand protonation and tautomer states with RDKit. Many of these molecules are not going to appear in neutral form or have multiple tautomers, and I don't believe the CCD distinguishes between them because the focus of the PDB is on heavy atoms. This will add to the number of molecules significantly.

This is pretty similar to what you suggest above, but doesn't exclude standard amino acids (since folks might use this subset on its own and it isn't much work to include them) and adds expansion of protonation / tautomeric states.

@peastman: For the more advanced small molecule / amino acid pairs dataset, is your goal to extract a whole cluster around each small molecule in the PDB, or just individual pairs of the small molecule directly interacting with a single residue at a time? Is it clear how we should truncate and cap the amino acids to ensure we do not introduce spurious interactions (e.g. by spitting dipole or creating zwitterions)?

There is a significant literature on methods that extract interacting pairs in this manner and subject them to quantum chemical calculations in order to compute approximations of accurate quantum chemical energies of the entire system, and it may be worth our while to investigate this literature. The Fragment Molecular Orbital (FMO) approach and Molecular Fractionation with Conjugate Caps are examples, but there may be many more.

Just one amino acid at a time. Including the whole environment would produce clusters of hundreds of atoms, which would be too expensive for the level of theory we're using and the number of molecules I want to include.

The PDB Chemical Component Dictionary.

Those datasets serve different purposes. The primary goal of the amino acid / ligand dataset is to provide more information about noncovalent interactions between proteins and ligands. It also would coincidentally provide more sampling of small molecule internal energies, but that's a secondary goal.

Is it clear how we should truncate and cap the amino acids to ensure we do not introduce spurious interactions (e.g. by spitting dipole or creating zwitterions)?

I plan to handle them with ACE and NME caps, the same as we did for the solvated amino acids.

Those datasets serve different purposes. The primary goal of the amino acid / ligand dataset is to provide more information about noncovalent interactions between proteins and ligands. It also would coincidentally provide more sampling of small molecule internal energies, but that's a secondary goal.

That won't be helpful unless we also have the monomer dataset. The monomer CCD dataset at the same level of theory must come before the dimer dataset.

I plan to handle them with ACE and NME caps, the same as we did for the solvated amino acids.

I'd recommend we take a deeper look at how we can appropriately minimize spurious effects rather than just adopt this strategy without further study. It may be the right approach, but it may not be. The Molecular Fractionation with Conjugate Caps approach looked promising, but I'm not certain there aren't better approaches out there.

That won't be helpful unless we also have the monomer dataset. The monomer CCD dataset at the same level of theory must come before the dimer dataset.

Why?

It's hard to separate the interaction energy from the intramolecular energy for a new species that does not appear in the training set, especially if there are rarer elements present. By having a number of monomer configurations in addition to dimer configurations, it's much easier to learn generalizable intermolecular interactions.

This is likely especially going to be true when we have only a single conformation of the pair extracted from a PDB file---I'd expect that a number of monomer snapshots plus a single dimer snapshot would work well for this.

Or were you thinking of generating many snapshots of the pair, somehow restraining their separation to sample relevant interactions?

These molecules contain no new species or rare elements. Many of them are quite similar (or even identical) to things that appear in PubChem. The whole point of general purpose potentials is that the model learns the laws of physics, not the details of individual molecules.

Remember, this is for SPICE version 2. It will still include everything from version 1. We already have a lot of sampling for internal conformations of very diverse molecules. Version 2 will have even more, especially if we're able to include Enamine molecules. But I also want to add more sampling of noncovalent interactions. Only about 1/3 of the conformations in version 1 involve multiple molecules, so there's a need for more.

I have recently been looking into crest a tool that uses the xtb GFN methods which could help with the generation of some of the datasets:

Remember, this is for SPICE version 2. It will still include everything from version 1. We already have a lot of sampling for internal conformations of very diverse molecules. Version 2 will have even more, especially if we're able to include Enamine molecules. But I also want to add more sampling of noncovalent interactions. Only about 1/3 of the conformations in version 1 involve multiple molecules, so there's a need for more.

Yes, but we're still missing representation of much of what is in the PDB. I'm just claiming we really want coverage of monomers of most of the interesting things in the PDB (and hence CCD) before we start considering dimers.

Since it's difficult to compile the thread and all the comments into a useful working list of datasets, I've started a wiki page here we can update.

Ultimately, once we settle on a format for the metadata we want to capture for each dataset, it would be fantastic to move that metadata into either a README.md, a computer-readable YAML file, or both.

The working document can form the basis of the next manuscript by capturing rationale, detailed methods, etc.

Thanks! What do you think would be a useful format for machine readable metadata?

Level(s) of theory used: mp2/cc-pv[tq]z + D:ccsd(t)/cc-pvdz (high level) ; ωB97M-D3BJ/def2-TZVPPD (SPICE default) ; B3LYP-D3BJ/DZVP (OpenFF default) ; GFN2-xTB (pretraining default)

We haven't actually run any of those, and some of them haven't even been discussed. Let's leave it blank, and fill in levels of theory once they're actually run. There's a need for the community to agree on standard levels of theory so different datasets can be meaningfully combined, but that's a discussion that needs to happen with a much larger group of people.

Thanks! What do you think would be a useful format for machine readable metadata?

A simple approach would be to have a metadata.yaml in each folder, containing a simple rendering of the metadata:

--
dataset:
  name: "water clusters at 300K 1atm"
  rationale: |
    To capture interactions representative of bulk water, we need samples of compact water clusters carved out of snapshots of bulk water at ambient temperature and pressure.
  composition: water molecules only
  strategy: |
    This collection contains 1000 conformations for a cluster of 30 water molecules extracted from bulk water simulations using the amoeba2018.xml force field at 300 K and 1 atm.
  qcfractal_dataset_type: SinglePointDataset
  theory:
  - spice1_default
  - openff_default
  properties:
  - energy
  - gradient
  - wavefunction_coefficients
  owner: peastman

If we standardize or index some items (such as levels of theory or generation strategies), we can just use a keyword to denote the choice we used.

We haven't actually run any of those, and some of them haven't even been discussed. Let's leave it blank, and fill in levels of theory once they're actually run. There's a need for the community to agree on standard levels of theory so different datasets can be meaningfully combined, but that's a discussion that needs to happen with a much larger group of people.

Good points. How about this for a better approach?

First, we index the levels of theory we have in use at the moment and their Psi4 specifications, since these are candidates for choices to run future sets of calculations (entirely, or in part) to ensure we can combine datasets. To date, this includes:

  • SPICE 1.0: ωB97M-D3BJ/def2-TZVPPD
  • OpenFF default: B3LYP-D3BJ/DZVP (SPICE 1.0 was also run at this level of theory and stored in QCArchive)
  • GFN2-xTB: You've generated some pretraining datasets with this, though it may be fast enough that we could just provide a parallel implementation and a way to regenerate the data in-place and cache it instead of distributing it.

Second, we start a thread to discuss candidates for other levels of theory to consider.
We may especially need to involve other levels of theory for solvated ions (which may require much more diffuse basis sets) and metals.
I've also seen significant interest in running a small subset of the data at a high level, such as CCSD(T) with a triple-zeta basis set.

We may also want to associate keywords with each dataset to make it easier to automate selection of datasets, e.g.,

elements: [H, O]
keywords:
- water
- droplet
- interactions
- solvent

Since this repository only contains input configurations, nothing in it is really specific to any level of theory. Maybe better to just give a link to qcarchive?

GFN2-xTB

I suggested it as a possibility, but I don't think we have agreement on whether it's actually useful.

Second, we start a thread to discuss candidates for other levels of theory to consider.

Definitely useful. We need to involve a lot of people in that discussion. The goal is to get the whole community to agree on standard levels of theory to improve consistency between datasets.

I've been looking more into CREST, earlier mentioned by @jthorton. It does seem to do great deal of what we are interested in, and uses GFN2-xTB, which may be more reliably close to QM than our MM force fields. Perhaps it's worth another look, even if it does require more filtering?

In https://doi.org/10.1021/acs.jpca.2c02439, OpenFF had roughly half the MAE of GFN2-xTB (see figure 10). That was on long alkanes. It might not do as well on molecules with more complex charge distributions. Still, I wouldn't assume a force field is necessarily less accurate than a semiempirical method.

I decided to try my own experiments. I loaded a few of the PubChem molecules from SPICE and recomputed the energies with OpenFF, GFN-FF, and GFN2-xTB. Here are the results. All values are in kJ/mol. Each row is a molecule and each column is a method.

image

All methods at least have a good correlation. The errors are small compared to the variation between conformations. On two of the three molecules OpenFF and GFN-FF had similar accuracy, while on one molecule GFN-FF was significantly less accurate. In all cases GFN2-xTB was the most accurate of the three methods.

These calculations used OpenFF 1.0.0. In the paper cited above they found it was more accurate than 2.0.0. I tested 1.0.0, 2.0.0, and 2.1.0, and my results agree with theirs. The differences weren't huge, but for every molecule 1.0.0 had the lowest MAE.

As a point of reference, the high energy conformations were generated by running MD at 500K. At that temperature, kT is 4.16 kJ/mol.

Here is the code, if you want to try it yourself.

Code
from openmm import *
from openmm.unit import *
from openff.toolkit.topology import Molecule, Topology
from openff.toolkit.typing.engines.smirnoff import ForceField
from xtb.interface import Calculator, Param, XTBException
from xtb.libxtb import VERBOSITY_MUTED
import matplotlib.pyplot as plot
import numpy as np
import h5py

forcefield = ForceField('openff_unconstrained-1.0.0.offxml')
f = h5py.File('SPICE-1.1.3.hdf5')

def compute_openff_energy(smiles, conformations):
    mol = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)
    fftop = Topology()
    fftop.add_molecule(mol)
    system = forcefield.create_openmm_system(fftop)
    integrator = VerletIntegrator(0.001*picosecond)
    context = Context(system, integrator, openmm.Platform.getPlatformByName('Reference'))
    energy = []
    for i in range(conformations.shape[0]):
        context.setPositions(conformations[i])
        energy.append(context.getState(getEnergy=True).getPotentialEnergy().value_in_unit(kilojoules_per_mole))
    return np.array(energy) - np.mean(energy)

def compute_xtb_energy(smiles, conformations, method):
    mol = Molecule.from_mapped_smiles(smiles, allow_undefined_stereo=True)
    numbers = np.array([a.atomic_number for a in mol.atoms])
    energy = []
    for i in range(conformations.shape[0]):
        calc = Calculator(method, numbers, conformations[i], mol.total_charge.m)
        calc.set_verbosity(VERBOSITY_MUTED)
        res = calc.singlepoint()
        energy.append((res.get_energy()*hartree/item).value_in_unit(kilojoules_per_mole))
    return np.array(energy) - np.mean(energy)

def plot_energies(energy1, energy2, molecule, potential):
    plot.scatter(energy1, energy2)
    lower = min(np.min(energy1), np.min(energy2))
    upper = max(np.max(energy1), np.max(energy2))
    plot.plot([lower, upper], [lower, upper], 'k')
    mae = np.mean(np.abs(energy1-energy2))
    plot.xlabel(f'{potential} MAE={mae:.4}')
    plot.ylabel('SPICE')
    plot.title(molecule)

plot.figure(figsize=(10,10))
for i, g in enumerate(list(f)[:3]):
    group = f[g]
    conf_in_bohr = np.array(group['conformations'])
    conf_in_nm = (conf_in_bohr*bohr).value_in_unit(nanometers)
    spice_energy = (np.array(group['formation_energy'])*hartree/item).value_in_unit(kilojoules_per_mole)
    spice_energy -= np.mean(spice_energy)
    smiles = group['smiles'].asstr()[0]
    openff_energy = compute_openff_energy(smiles, conf_in_nm)
    gfnff_energy = compute_xtb_energy(smiles, conf_in_bohr, Param.GFNFF)
    xtb_energy = compute_xtb_energy(smiles, conf_in_bohr, Param.GFN2xTB)
    plot.subplot(3, 3, 3*i+1)
    plot_energies(openff_energy, spice_energy, g, 'OpenFF')
    plot.subplot(3, 3, 3*i+2)
    plot_energies(gfnff_energy, spice_energy, g, 'GFN-FF')
    plot.subplot(3, 3, 3*i+3)
    plot_energies(xtb_energy, spice_energy, g, 'GFN2-xTB')
plot.tight_layout()
plot.show()

This is a very informative little experiment @peastman! It's interesting that although GFN-xTB is slightly better, it doesn't appear to be by much. Most importantly, you're not seeing the behavior that @jchodera mentioned where SPICE conformations that are predicted to be relevant/low energy on the MM surface are high energy on the QM surface. A couple of questions.

  1. @jchodera, do you recall some specific examples of this bad MM behavior you were describing (ex. double bonds)? Is it possible that some of these cropped up in the solvated AA dataset because of the TIP3P water model?
  2. @peastman, how much more expensive are the GFN-xTB calculations?

This is great! These data anecdotally suggest that using GFN2-xTB for minimization or dynamics (or possibly a couple of steps of minimization if we cannot afford to use it for dynamics) is our best shot at ensuring we get close to the QM energy surface.

If the RMSE is only ~ 1-2 kT as the MAE for these few molecules suggests, then this is really quite good.

Did you happen to compute the statistics over the whole dataset, rather than just a few examples? Our OpenFF 2.x RMSE was much higher over the PubChem dataset (RMSE 4.43 kcal/mol OpenFF 2.0; 4.71 kcal/mol OpenFF 2.1; 4.64 kcal/mol GAFF2.11).

In terms of understanding the differences with your numbers, there are a number of differences:

  • You computed statistics for a few molecules; we computed statistics for the whole dataset
  • You computed MAE; we computed RMSE (which will naturally be somewhat larger since it gives more weight to larger outliers)
  • You used the ωB97M-D3BJ/def2-TZVPPD data; we used B3LYP-D3BJ/DZVP
  • You reported statistics for OpenFF 1.x; we assessed the more recent OpenFF 2.x
  • Some other difference in how we're doing the calculations?

cc @kntkb @yuanqing-wang

@jchodera, do you recall some specific examples of this bad MM behavior you were describing (ex. double bonds)? Is it possible that some of these cropped up in the solvated AA dataset because of the TIP3P water model?

I'll see if we can pull out a sorted list of the largest outlier geometries.

Did you happen to compute the statistics over the whole dataset, rather than just a few examples?

Just a few examples.

how much more expensive are the GFN-xTB calculations?

Compared to OpenFF, about 1000x slower. Computing forces for a small molecule takes a few tens of milliseconds. On the other hand, OpenFF requires an expensive parametrization (~10 seconds) before starting the simulation. For generating conformations, I think it's practical to use GFN2-xTB for minimizations but not for simulations.

One other point to consider is that for peptides we can use Amber, which is much more accurate than OpenFF, and probably even more accurate than GFN2-xTB. This might be relevant to #72, which has amino acids interacting with ligands. The current code there uses Amber for the amino acid and OpenFF for the ligand.

Again, our data suggested that ff14SB is not any better than OpenFF for the SPICE dipeptide dataset in relation to QM (comparing to the B3LYP-D3BJ/DZVP level of theory):

  • ff14SB: 4.44 [4.29, 4.57] kcal/mol RMSE
  • GAFF 2.11: 4.28 [4.15, 4.42] kcal/mol RMSE
  • OpenFF 2.0.0: 4.15 [4.01, 4.29] kcal/mol RMSE
  • OpenFF 2.1.0: 4.35 [4.22, 4.50] kcal/mol RMSE

The numbers in brackets are 95% CI from bootstrapping over the molecules, to account for dataset size limitations. These are essentially mostly within statistical error of each other.

As I recall, your earlier experiment suggesting ff14SB was close to ωB97X-D3/def2-TZVP was (1) just alanine dipeptide, and (2) on energy-minimized conformations.

I'm looking into building a subset of molecules from Ligand Expo. Could you clarify this suggestion?

Expand protonation and tautomer states with RDKit.

I can use TautomerEnumerator to identify tautomers for a molecule, but I can't find any function to do the same for protonation. How would one do that?

Running TautomerEnumerator on all the molecules from Ligand Expo produces output like this:

[15:05:21] Tautomer enumeration stopped at 1000 tautomers: max tautomers reached
[15:05:21] Tautomer enumeration stopped at 1000 tautomers: max tautomers reached
[15:05:21] Tautomer enumeration stopped at 1000 tautomers: max tautomers reached
[15:05:22] Tautomer enumeration stopped at 522 tautomers: max transforms reached
[15:05:22] Tautomer enumeration stopped at 918 tautomers: max transforms reached

In other words, it produces truly vast numbers of tautomers. It wouldn't just increase the number of molecules. It would increase it by 2-3 orders of magnitude. And it would be even higher if RDKit weren't stopping when it hit a limit.

If we want to do this, we need to find some better way of selecting tautomers. Is there a way to identify just a few per molecule that in some way are most useful?

Here are some more precise numbers. If we include all molecules with up to 100 atoms that pass some basic filters, that gives us 38,289 molecules with a median size of 43 atoms. If we allow up to four tautomers for each one, that gives us 120,047 with a median size of 44 atoms. Even with only one conformation per molecule that will take a significant amount of computation. Including multiple protonation states, more than four tautomers, or multiple conformations per molecule could easily exceed what's practical to do.

I could just select four tautomers at random, but is there a better way?

@jchodera do you have any thoughts on the above questions?

@peastman: My apologies for the delay in response!

The OpenFF Toolkit Molecule object includes methods for enumerating protonation states and then subsequently enumerating tautomers that handle the appropriate calls to underlying toolkits (RDKit or OpenEye) for you. However, there is a bit of trickiness in its use---see openforcefield/openff-toolkit#1464.

Note that QCSubmit can do this for you automatically.

Here are some more precise numbers. If we include all molecules with up to 100 atoms that pass some basic filters, that gives us 38,289 molecules with a median size of 43 atoms. If we allow up to four tautomers for each one, that gives us 120,047 with a median size of 44 atoms. Even with only one conformation per molecule that will take a significant amount of computation. Including multiple protonation states, more than four tautomers, or multiple conformations per molecule could easily exceed what's practical to do.

We clearly don't need every combination of

[molecule] x [protonation state] x [tautomer] x [MD snapshots]

but it's similarly foolish to omit the fact that protonation states do exist and can considerably change the properties of a molecule.

What about a strategy of expanding protonation/tautomer states for a subset of molecules from each set with a highly reduced number of low-energy conformers from each set? However, the protonation state change will change the energy surface substantially. In fact, we might be able to kill two birds with one stone by generating an additional protonation/tautomer set that only includes an OptimizationDataset of

[subset of molecules with multiple protonation states] x [protonation state] x [tautomer] x [up to 3 initial conformers]

This would allow us to control the subset size while still sampling multiple protonation states and low-lying energies of multiple conformers.

How do you suggest selecting the states to use? We could of course just a pick a few at random, but is there a better way of selecting ones that are most useful/representative/realistic?

Enumerating protonation states appears to only be supported with OpenEye, not RDKit.

OK, new proposal: If you can get me an SDF with each molecule enumerated in a few conformations, I can run it through Epik to expand protonation/tautomer states predicted to have solution phase free energies up to a cutoff (say 6 kT). That will limit the protonation/tautomer states to only the most accessible. This use is covered by our institutional Schrödinger license.
For reference, the command is:

$SCHRODINGER/utilities/structconvert input.sdf input.mae
$SCHRODINGER/epik -WAIT -ms 16 -ph 7.3 -p {min_population} -imae input.mae -omae output.mae
$SCHRODINGER/utilities/structconvert output.mae output.sdf

which will limit us to 16 microstates and access down to min_population in population (e.g. 6 kT -> exp(-6)).

The protonation states will be enumerated in the same geometry as the original conformer, so we have the added benefit that we will have snapshots in the same conformation with various protonation states. Ideally, we can optimize them away from that for a few QM steps as well.

The protonation states will be enumerated in the same geometry as the original conformer, so we have the added benefit that we will have snapshots in the same conformation with various protonation states.

I'm not sure that's a benefit. If you take a realistic conformation for a molecule, then change the charges and rearrange a few pieces, you may end up with a very unrealistic conformation. It might be better to generate a new conformation for each one.

If I post a short list of SMILES strings, can you run each one through it and see how many states it would select? I want to see whether that's an effective way of filtering them.

I'm not sure that's a benefit. If you take a realistic conformation for a molecule, then change the charges and rearrange a few pieces, you may end up with a very unrealistic conformation. It might be better to generate a new conformation for each one.

Most conformer generators use rotatable torsions to generate plausible molecular conformations. The precise protonation state does not generally impact these conformations, but we can also filter out very high energy ones if this is a concern.

If I post a short list of SMILES strings, can you run each one through it and see how many states it would select? I want to see whether that's an effective way of filtering them.

Sure!

Here are the first 10. All of them are neutral. RDKit identifies varying numbers of tautomers for them, ranging from 1 to over 1000.

COC(=O)O
COc1cc(cc(c1OC)OC)C(C(=O)N2CCCCC2C(=O)OC(CCCc3ccccc3)CCCc4cccnc4)(F)F
CCC(C)C(C(=O)NC(CC(C)C)C(=O)O)NC(=O)C(Cc1ccccc1)CC(=O)NO
CC(C)CN1c2c(c(n(n2)Cc3cccc4c3cccc4)c5ccncc5)C(=O)N(C1=O)C
c1ccc(cc1)C(C(=O)O)N
c1ccc(cc1)CC(C(C(=O)O)O)N
Cc1cccc(c1OCC(=O)NC(Cc2ccccc2)C(C(=O)N3CSC(C3C(=O)NC4c5ccccc5CC4O)(C)C)O)C
c1ccc(cc1)C2(CCCC2)CN
c1ccc2c(c1)CC(N(C2)C(=O)CC(Cc3ccccc3F)N)C(=O)N
CN(C)c1cccc(c1)CNCC(C(Cc2ccccc2)NC(=O)C3(CN(C(=O)N3)Cc4ccccc4)Cc5ccccc5)O

And here are a few charged or polar ones.

C[n+]1ccc(cc1CNC(=O)C2CCCN2C(=O)C(Cc3ccccc3)N)Cl
C(CC(=O)C[N+]#N)C(C(=O)O)N
c1cc(c(cc1C(F)(F)F)[N+](=O)[O-])NCc2ccsc2
c1cc(c(cc1[N+](=O)[O-])[N+](=O)[O-])NCCC(C(=O)[O-])[NH3+]

I used a script that uses (for convenience) the OpenFF Toolkit, OpenEye Toolkit, and Schrödinger Suite to

  • Read the SMILES
  • Enumerate unspecified stereochemistry (Note 1)
  • Enumerate conformers
  • Enumerate protonation states for each conformer

Note 1: We will want to restrict this to unique sets of diastereomers---I haven't yet implemented this.

We will also want to cap the number of unique diastereomers we consider, as well as unique protonation states.
Currently, I limited this to 3 stereoisomers (though some may be enantiomers), 3 conformers (ensuring >1A RMSD between them), and 16 protonation states capping the predicted minimum population at exp(-6).

Here's an SDF file containing the output:
protonation-states.zip

I can regenerate with different limits, of course, but I think this should at least illustrate the concept I'm proposing. It would be useful to initiate a few steps of gradient descent from a consistent conformation where a small number of accessible protonation/tautomer states are enumerated.

Taking just a few steps of optimization from each conformer should now be possible thanks to this change.

We don't need to do this on all molecules---just a representative subset. And we can use whatever strategy you think is best to enumerate conformers---we just have to pick some representative conformers to enumerate protonation states. But I do think a few steps of gradient descent started from consistent conformers and different protonation/tautomer states will be highly informative.

Currently, I limited this to 3 stereoisomers (though some may be enantiomers), 3 conformers (ensuring >1A RMSD between them), and 16 protonation states capping the predicted minimum population at exp(-6).

Can you do it not limiting the number? The question is to see whether filtering on population is effective. Some of these molecules have hundreds of tautomers, and adding protonation states will make the number even larger. Does the population filter bring it down to a reasonable number, or are we still mostly just picking a few at random?

Instead of imposing a cutoff of the population, what about sorting all of them and selecting the 3 or 4 most populated ones?

We didn't hit the 16-state limit for any of compounds.

Here's what we had with the exp(-6) cutoff---not many states:

2 protonation states
2 protonation states
1 protonation states
2 protonation states
2 protonation states
2 protonation states
2 protonation states
2 protonation states
2 protonation states
2 protonation states
2 protonation states
2 protonation states
1 protonation states
2 protonation states

Instead of imposing a cutoff of the population, what about sorting all of them and selecting the 3 or 4 most populated ones?

What's our goal here? I think it's important to be able to represent physically reasonable states with good accuracy, though we should probably have some additional protonation states that are much less likely to be populated as "negative" training examples or else the model may extrapolate poorly in that regime.

In that sense, you may be right that going to ~4 most populated states will sometimes just get a few of the populated states, and sometimes include really unpopulated states. I'll regenerate and provide an updated set of examples for this set.

The goal is to provide data on a very diverse set of molecules that covers a range of chemical space users are interested in. But we also need to limit the number of molecules to stay within the computational budget. Ligand Expo provides us with about 38,000 molecules. You suggested including all tautomers and protonation states for them, but that would bring us up into the ~10 million molecule range, way beyond what it's practical to do. We need to select which of those 10 million molecules to actually compute, hopefully in some way that reflects which ones are most relevant to users.

We didn't hit the 16-state limit for any of compounds.

I'm not clear on exactly what calculation you did. If a molecule has 2 protonation states and 500 tautomers, that's 1000 states total. Did you calculate the free energies for all 1000 of them? How many pass the cutoff?

How fast is this analysis? Is it practical to run it for 10 million molecules to use a filter?

The goal is to provide data on a very diverse set of molecules that covers a range of chemical space users are interested in. But we also need to limit the number of molecules to stay within the computational budget. Ligand Expo provides us with about 38,000 molecules. You suggested including all tautomers and protonation states for them, but that would bring us up into the ~10 million molecule range, way beyond what it's practical to do. We need to select which of those 10 million molecules to actually compute, hopefully in some way that reflects which ones are most relevant to users.

I agree we'll need to select a useful subset. I'll work out a way to do this.

I'm not clear on exactly what calculation you did. If a molecule has 2 protonation states and 500 tautomers, that's 1000 states total. Did you calculate the free energies for all 1000 of them? How many pass the cutoff?

Apologies for the confusing terminology---I should have said that most of these have 2 microstates within the exp(-6) population threshold. Here, microstates = combination of protonation/tautomeric states.

Epik is slow (takes several seconds), but we have lots of CPU threads. If we wanted to run it on lots of molecules, it will just take a few days.

That's not bad then. We're probably looking at under 100,000 molecules. I suggest we use a sequence of two scripts. The first one will generate the states, apply the free energy filter, and write out SMILES strings to a file. Once we see how many molecules we're dealing with, we can decide how many conformations to include for each one. A second script will read in the first file, generate the conformations, and create the HDF5 file.

I can create a draft PR with an outline of the first script, if that sounds reasonable.

This approach sounds reasonable!

The first one will generate the states, apply the free energy filter, and write out SMILES strings to a file.

Since this requires running Epik, do you want me to tackle this script, or did you want to create an outline and then I finish it up and run it?

I'll create an outline.

If we had infinite computational resources, that could be a reasonable approach. But we can only compute a very limited number of conformations. That means we need to choose them carefully so that every conformation adds as much information as possible. Computing two ~100 atom molecules that differ only in a single hydrogen, both in exactly the same conformation, makes very inefficient use of our resources. Most atoms will have nearly identical environments and forces in both of them. Choosing different conformations will contribute much more information content to the dataset.

I created a project board for SPICE 2: https://github.com/orgs/openmm/projects/2/views/1

What you're suggesting might work better as a separate dataset. We could create a collection of very small molecules, maybe around 10 atoms each, and exhaustively try all variations on each one. They'll be fast enough and the number of variations will be small enough that computation time won't be an issue. And for very small molecules, changing a single atom affects the environment of every other atom, so it will provide meaningful information for all of them.

Inspired by the AIMNET2 dataset we could also include molecules with As and Se from PubChem although I have no idea how common they will be.

Splinter a dataset of protien-ligand interactions. Inputs can be found here.

It would be nice to have data to create a protein FF

We had dipeptides and solvated amino acids in Spice 1.0, in addition to that singlepoints on the final geometries from OpenFF's 2D torsion scans of (chi1, chi2), (phi, psi) dihedral pairs of capped amino acid chains can be of interest, here are some datasets created by Chapin which were all optimized at b3lyp-d3bj/dzvp level on a 24x24 2D grid. Just re-evaluating energy on final geometries at Spice DFT level can be considered.

Some 1D torsion scans that might be of interest

There were also some RNA datasets created by Ken Takaba which were used in training Espaloma

SPICE 2 is almost finished. It will hopefully just take a few more weeks to complete the calculations. I've opened a new issue (#92) for SPICE 3. Let's move this discussion there.

@peastman : I know you've already included ligand:amino acid pairs in #72, but I just came across this preprint that mentions two interesting datasets:

  • The Splinter dataset is a collection of approximately 1.7 mil- lion systematically generated protein-ligand fragment dimers and interaction energies computed using many-body SAPT based on a Hartree-Fock (HF) representation of monomers (i.e. SAPT0). These and all other SAPT computations carried out in this work are performed in an aug-cc-pV(D+d)Z basis set (abbreviated aDZ), which yields good error cancellation.
  • SAPT-PDB-13K: A diverse, realistic dataset of dimers. The 13,216 dimers in SAPT-PDB-13K consist of an entire ligand interacting with one or two capped amino acids. The protein and ligand geometries are taken from crystallographic Protein Data Bank (PDB) entries, making them meaningful and practical test cases

image

image

The SAPT-PDB-13K seems to be pending deposition somewhere:
"Electronic Supplementary Information (ESI) available: SAPT0 interaction energies and Cartesian coordinates of the 13,216 validation set dimers and nine protein- ligand matched pairs. See DOI: 00.0000/00000000"

Can you post those on the thread for SPICE 3 (#92)? We're collecting ideas for it now.

And the calculations for SPICE 2 are DONE! I'll take a few days to run some tests, and then I can hopefully release it next week.

And it is now released!