brucefan1983 / GPUMD

Graphics Processing Units Molecular Dynamics

Home Page:https://gpumd.org/dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dump_observer average integration test failing

elindgren opened this issue · comments

The integration test for dump_observer:average is failing, with the predicted averages and forces almost being the same, but not quite.

image

I dug a bit deeper in this, and I've been able to determine a few things:

  1. If the same potential is specified twice (i.e. potential nep0.txt twice in the run.in file), then the average is computed correctly (being identical to the prediction of one of the models).
  2. If the same potential is copied to another file (i.e. potential nep0.txt and potential nep0_copy.txt in run.in) then the average is computed correctly.
  3. If the parameters of the second potential, nep0_copy.txt, (either hyperparameters or weights) are modified, then the average is not computed correctly.
  4. Changing the denominator in gpu_average_properties from the number of potentials to for instance 1.0 does not produce the expected result, except for when the same potential is specified twice.
  5. The size of the relative error in the prediction seems to change with the number of atoms in the system, see the figure below.

error-scaling

Perhaps you can using binary search to pinpoint the change that causes the failure?

This PR (#495) has introduced a change for initial force, hence the trajectory. You can check if it is this one that is reponsible for the failure.

This PR (#495) has introduced a change for initial force, hence the trajectory. You can check if it is this one that is reponsible for the failure.

Yes I saw that the initial values had changed, and updating that fixed the issue for dump_observer observe but not dump_observer observe. So I think there is something more to it.

Perhaps you can using binary search to pinpoint the change that causes the failure?

I'm not sure I understand what you mean? Going through each commit until the test breaks?

yes, figure out when it breaks.

I used git bisect to find the faulty commit to be the following:

6f437f7d827a50702e30ce8d2be71b975ee1d1ba is the first bad commit
commit 6f437f7d827a50702e30ce8d2be71b975ee1d1ba
Author: psn417 <psn417@icloud.com>
Date:   Thu Sep 14 23:19:32 2023 +0800

    Add NPT, can run now

    still have problem

Here is a link to the commit: 6f437f7

Then it should be due to the fact that a force evaluation is added before the loop of integration in run.cu (which is correct).

You can try to comment out that force evaluation call and see if the regression test passes. If so you can update your reference data.

Then it should be due to the fact that a force evaluation is added before the loop of integration in run.cu (which is correct).

You can try to comment out that force evaluation call and see if the regression test passes. If so you can update your reference data.

Yes that was my first thought as well, but that is not the case. Updating the training data only fixes the test for the case of observer, but average still fails. I suspect there might be some extra force calculation somewhere that throws things out of sync.

I figured it out. I just needed to regenerate the reference data, but the way I did that was faulty. With the reference data being generated on the actual trajectory from dump_average, everything works as expected again. This means that the functionality was never broken; it was just the test that broke, which is good.