openmm / spice-dataset

A collection of QM data for training potential functions

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Samples with extremely large forces

raimis opened this issue · comments

The dataset contains samples with forces ~7 magnitudes of order larger than average forces. The force distribution of the SPICE data set:
SPICE_force_distribution

For reference, the force distribution of the ANI-1x data set:
ANI-1x_force_distribution

I don't see the utility for such extreme samples:

  • They will ruin training of NNPs (@peastman is already filtering in his trainings).
  • They will ruin benchmarks as majority of NNPs probably won't be fitted for such extremes.

So, people will have to filler. It is guaranteed that it will be done in different ways. So, nobody would be able to say the they used the SPICE dataset without an asterix.

I'm not sure if it's possible to remove samples from the data on QCArchive? We could add an option to the downloader script to filter them at that point.

These strained samples happened by accident, presumably from differences between the DFT functional and the MD force field used to generate conformations. But in the future we may intentionally add more of them. If we want to be able to train reactive potentials, we'll need to add data where bonds are in the process of being formed. Those will have very high forces. We need to come up with a strategy for handling this.

I think the most sensible option is to (1) have the download script add the option to filter (perhaps enabled by default with a sensible threshold), and (2) prepare the default HDF5 file where we've done the sensible thing and removed problematic snapshots with very high forces.

#41 changes the behavior of the downloader to filter out samples with large forces by default.