checkpoint file not read in foundation branch.

Question

checkpoint file not read in foundation branch.

jungsdao opened this issue 4 months ago · comments

I wanted to finetune a model starting from foundation model in foundation branch.
I have provided some training structures to foundation model and got a model and checkpoint file.
With these at hand, I wanted to provide more structures and this time wanted to start from previous checkpoint file rather than starting again from foundation's checkpoint.
However, it doesn't seem to recognize the existence of previous ckeckpoint file which I have copied already.
I think it used to read checkpoint file automatically if it's in the checkpoint directory, but in foundation branch, it's not reading it properly.
Is it a sort of bug in foundation branch?
Following is my fitting command line where I have excluded --foundation_model="medium" to not start from foundation model checkpoint.
Many thanks in advance!

mace_run_train \
  --name="umbrella" \
  --energy_key="DFT_energy" \
  --forces_key="DFT_forces" \
  --train_file="training_12.xyz" \
  --valid_fraction=0.1 \
  --E0s="{1:-14.9005442054276, 6:-162.973421385767, 8:-438.578998764142, 45:-3089.70420527816}" \
  --r_max=6.0 \
  --energy_weight=1.0 \
  --forces_weight=1.0 \
  --lr=0.01 \
  --scaling="rms_forces_scaling" \
  --batch_size=16 \
  --max_num_epochs=400 \
  --start_swa=300 \
  --swa \
  --ema \
  --ema_decay=0.99 \
  --amsgrad \
  --error_table='PerAtomMAE' \
  --default_dtype="float64" \
  --device="cuda" \
  --save_cpu \
  --seed=3

Ilyes Batatia · Answer 1 · Wed Jan 31 2024 21:26:57 GMT+0800 (China Standard Time)

What is the name of your first run? You need to keep the same name. Also, you should keep --foundation_model="medium" to get the right hypers for continuing your training. It will overload the foundation model with the checkpoint if it finds one.

Hyunwook Jung · Answer 2 · Wed Jan 31 2024 21:33:45 GMT+0800 (China Standard Time)

name of first run was "umbrella" which is the same as in command line.
I have kept --foundation_model="medium" but it it's still not recognizing the previous checkpoint file.
If it had recognized it, then it should've started from around epoch 196 since previous checkpoint filename is umbrella_run-3_epoch-196_swa.pt but it starts from epoch 0.
Also one of logfile line says 2024-01-31 13:30:10.354 INFO: Using foundation model medium as initial checkpoint. probably it's not starting from umbrella_run-3_epoch-196_swa.pt, but from foundation model's checkpoint.

Ilyes Batatia · Answer 3 · Wed Jan 31 2024 22:18:53 GMT+0800 (China Standard Time)

Can you share the log files for the three runs?

Hyunwook Jung · Answer 4 · Wed Jan 31 2024 22:35:56 GMT+0800 (China Standard Time)

This is logfile starting from foundation model.
umbrella_run-3.log

And following is logfile which I wanted to start from checkpoint file of previous one. (It is interrupted during training)
umbrella_run-3.log

Ilyes Batatia · Answer 5 · Wed Jan 31 2024 22:41:29 GMT+0800 (China Standard Time)

Can you make sure you included --restart_latest in your input?

Hyunwook Jung · Answer 6 · Wed Jan 31 2024 22:48:40 GMT+0800 (China Standard Time)

Oh actually it was my mistake of missing --restart_latest. Sorry for the trouble!!
After adding that keyword, it starts from epoch 196.
Like in this logfile.
umbrella_run-3.log

But I see there's epoch None in the very beginning when I work in foundation branch which was not the case in main branch. Does it have any meaning?
Anyway I appreciate for pointing out my mistake!

Ilyes Batatia · Answer 7 · Wed Jan 31 2024 22:53:31 GMT+0800 (China Standard Time)

Nice! Epoch None corresponds to before the first new epoch. We added that to track better the fine tuning.