Any way to have CheckpointManager write earlier checkpoints?
hrbigelow opened this issue · comments
Hi,
I have a directory of checkpoints from a previous run in which I had set
options = ocp.CheckpointManagerOptions(save_interval_steps=3000, max_to_keep=50)
I ran the code longer than expected, and now have 50 checkpoints:
126000
129000
132000
135000
...
273000
I had wanted to have all of the checkpoints starting from 0, so I should have chosen a larger max_to_keep
. I restarted from the beginning, now using max_to_keep=200
, but it seems that the CheckpointManager is not saving the checkpoints 0, 3000, 6000, etc, even though the directory doesn't have 200 items yet.
Is this expected behavior, and is there any easy way I can get Orbax to fill in older checkpoints up to max_to_keep
and then start to delete them based on the checkpoint order, not the order in which they were written?
Thanks,
Henry
If you want all the checkpoints starting from 0, it sounds like you should not be using max_to_keep
at all, but setting it to None so no checkpoints are deleted?
I think probably the issue is that the earlier steps (0, 3000, etc.) are not getting saved at all because they are earlier than the existing steps, and thus out of sequence. Can you check ckpt_mngr.should_save(0)
? I would expect this to return false. You can force a save on this step using ckpt_mngr.save(0, state, force=True)
. Note that you would still get an error with force
if the checkpoint already exists. And force
would also override the interval, so you would get checkpoints 1000 and 2000, etc.
Ahh, yeah, I suspected that orbax might have a rule that it doesn't save a checkpoint if a later one exists. As far as I know it isn't documented though, so I wanted to make sure things were working as expected. In any case, this isn't much of an issue - I used force=True and it works fine. Thanks for the response.