openmm / spice-dataset

A collection of QM data for training potential functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What level of theory to use

peastman opened this issue · comments

We need to select a level of theory that gives us the best possible accuracy while keeping within our computation budget. https://www.tandfonline.com/doi/full/10.1080/00268976.2017.1333644 seems to be the most comprehensive study of DFT functionals currently available. It was published in 2017, but I'm told the results are still basically up to date.

Based on that article and my own tests, I suggested ωB97M-V/def2-TZVPPD. However, I'm told that Psi4 can't currently compute forces with VV10. We should contact the developers and see if they're planning to add that feature. If not, ωB97M-D3BJ would be the obvious substitute.

I opened an issue on the Psi4 repo to ask about VV10. I'll report back what they say.

They say they'd like to have the feature, but no one has committed to implement it.

@peastman : Can you cc that psi4 issue here?

Here is the paper for wB97M-D3BJ (https://pubs.acs.org/doi/10.1021/acs.jctc.8b00842), which replaces the -V dispersion correction with the cheaper -D3(BJ). If this functional does have similar accuracy to wB97M-V, as the authors said in the abstract, then it seems fine to proceed with this as a substitute.

Thanks! That's a very helpful article.

The part in the abstract about not affecting accuracy was comparing whether you apply VV10 as part of the SCF calculation or as an additive correction afterward. They concluded it makes hardly any difference. They found D3(BJ) is less accurate than VV10, though still pretty good. The key numbers are in Table 5, where they compare the accuracy of the two for several functionals and different subsets of the database.

That's correct. In the abstract, the authors did have a comment on the -D3BJ ones as well:
"We also present new DFT-D3(BJ) based counterparts of these two methods and of ωB97X-V [J. Chem. Theory Comput 2013, 9, 263], which are faster variants with similar accuracy."
Here "these two methods" refer to B97M-V and ωB97M-V.

Table 5 is indeed very informative. On one hand, it shows that the cheaper substitute, ωB97M-D3(BJ), is indeed less accurate than the original ωB97M-V, especially for the category "intermolecular noncovalent interactions", for which the ones with actual -V just work markedly better. On the other hand, ωB97M-D3(BJ) outperforms ωB97X-D3(0) (the -D3 functional I previously recommended based on my slightly outdated knowledge, shown in the last column) on every category, so the former seems to be the best functional that we can choose besides the -V ones and ωB97X-D3(0) should probably no longer be considered.

VV10 adds a lot to the cost of the calculations no? I've been told by a student of Van Voorhies that even they don't run calculations with their dispersion corrections because it's too expensive.

If we really want something highly accurate, it would be worth considering choosing a "cheap" level of theory to build the full data set with, and then using some active learning method to find a minimal subset of the full data set which is sufficient to accurately learn the full data set.

The original ANI used something like 20 million data points and only had 4 elements. We're looking to use like 13 elements and include charged systems too. Just for a ballpark estimate we're looking at something like 100 million data points needed if we're designing the data set by hand (i.e. no machine learning to guide us which molecules and geometries add the most information). ANI used an approach where they found a minimal data set of their 20 million points and recomputed that subset in CCSD(T). It was something like a few million data points at CC quality needed to achieve similar accuracy, which is a ton cheaper than running all 20 million with CC.

The benefit here is once we determine a minimal subset we can always recompute the data in however many expensive levels of theory we want to add later.

ANI only includes energies, not forces, and only molecules with up to 8 heavy atoms. Since we'll be including forces too, and most of our molecules will be much larger, our dataset should have about an order of magnitude more information content than ANI.

Agreed about eventually computing a subset with a higher accuracy method. The most data efficient way is to train a model on the full dataset, then fine tune it on the higher accuracy subset.

Also, didn't we drop the idea of doing VV10 since it's not available in Psi4?

Correct.

@peastman ANI definitely included forces. Maybe they didn't release them in their data set or something at one point but I'm positive they trained their model with forces and I'm pretty sure you can get the forces for their data set in the publication in Scientific Data. Maybe I'm overshooting it with 100 million but I don't think it's going to take as little data as you might think.

@tmarkland what I'm saying is once you find a minimal subset, you can always rerun the subset in whatever theory you want later. If Psi4 eventually adds gradients for vv10 then recalculating the subset to finetune the model won't be so expensive to run.

Here's the paper on ANI.

https://www.nature.com/articles/sdata2017193

They never mention forces, only energies. And figure 2 describes the structure of the data files, clearly showing no forces.

As another comparison, OrbNet Denali includes 2.3 million conformations, again including only energies, and 17 different elements. Yet they manage to train a model to quite good accuracy. I suspect the problem with ANI is that they were using an old style of model that doesn't work as well. Modern architectures are a lot more data efficient.

I checked the original ANI paper (not the data set one) and maybe they didn't train with forces because they don't seem to mention it. I could be wrong and they didn't use forces, but I'm skeptical of that.

Orbnet has a really high transferability because they parameterize on orbitals instead of atoms but that's going to be more costly. I would think Orbnet would be too expensive for the kinds of simulations we want to be able to run.

Agreed about OrbNet being too expensive. Our target is probably going to be equivariant models along the lines of PaiNN or Gianni's equivariant transformer. Those models tend to be a lot more data efficient than ANI. https://arxiv.org/pdf/2108.02913.pdf is an example of a recent paper that specifically looked at that for an equivariant model. They conclude it only needs about 1-10% as much training data as an invariant model.

Are you sure they are randomly sampling the ANI-1 data set in that paper when they report how many geometries they used? They reported they used 2 million geometries and achieved an accuracy of 0.65 kcal/mol/atom on the test set, but they have the advantage of having the entire data set of 20 million geometries already calculated for them. It's much easier to find a subset of the data which maintains high accuracy when you have the advantage of being able to use some kind of active learning approach. The problem with that is you already need the 20 million calculations done before you can easily do that.

For the hydrogen combustion dataset, they specify that the subset was chosen randomly. For ANI, they don't say how it was chosen. That probably means it was also done randomly (if they were using an optimally chosen subset, I'd expect them to say so), but I don't know for sure.

Closing since version 1 is now released.