ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

checkpoint file not read in foundation branch.

jungsdao opened this issue · comments

I wanted to finetune a model starting from foundation model in foundation branch.
I have provided some training structures to foundation model and got a model and checkpoint file.
With these at hand, I wanted to provide more structures and this time wanted to start from previous checkpoint file rather than starting again from foundation's checkpoint.
However, it doesn't seem to recognize the existence of previous ckeckpoint file which I have copied already.
I think it used to read checkpoint file automatically if it's in the checkpoint directory, but in foundation branch, it's not reading it properly.
Is it a sort of bug in foundation branch?
Following is my fitting command line where I have excluded --foundation_model="medium" to not start from foundation model checkpoint.
Many thanks in advance!

mace_run_train \
  --name="umbrella" \
  --energy_key="DFT_energy" \
  --forces_key="DFT_forces" \
  --train_file="training_12.xyz" \
  --valid_fraction=0.1 \
  --E0s="{1:-14.9005442054276, 6:-162.973421385767, 8:-438.578998764142, 45:-3089.70420527816}" \
  --r_max=6.0 \
  --energy_weight=1.0 \
  --forces_weight=1.0 \
  --lr=0.01 \
  --scaling="rms_forces_scaling" \
  --batch_size=16 \
  --max_num_epochs=400 \
  --start_swa=300 \
  --swa \
  --ema \
  --ema_decay=0.99 \
  --amsgrad \
  --error_table='PerAtomMAE' \
  --default_dtype="float64" \
  --device="cuda" \
  --save_cpu \
  --seed=3 

What is the name of your first run? You need to keep the same name. Also, you should keep --foundation_model="medium" to get the right hypers for continuing your training. It will overload the foundation model with the checkpoint if it finds one.

name of first run was "umbrella" which is the same as in command line.
I have kept --foundation_model="medium" but it it's still not recognizing the previous checkpoint file.
If it had recognized it, then it should've started from around epoch 196 since previous checkpoint filename is umbrella_run-3_epoch-196_swa.pt but it starts from epoch 0.
Also one of logfile line says 2024-01-31 13:30:10.354 INFO: Using foundation model medium as initial checkpoint. probably it's not starting from umbrella_run-3_epoch-196_swa.pt, but from foundation model's checkpoint.

Can you share the log files for the three runs?

This is logfile starting from foundation model.
umbrella_run-3.log

And following is logfile which I wanted to start from checkpoint file of previous one. (It is interrupted during training)
umbrella_run-3.log

Can you make sure you included --restart_latest in your input?

Oh actually it was my mistake of missing --restart_latest. Sorry for the trouble!!
After adding that keyword, it starts from epoch 196.
Like in this logfile.
umbrella_run-3.log

But I see there's epoch None in the very beginning when I work in foundation branch which was not the case in main branch. Does it have any meaning?
Anyway I appreciate for pointing out my mistake!

Nice! Epoch None corresponds to before the first new epoch. We added that to track better the fine tuning.