google / orbax

Orbax provides common utility libraries for JAX users.

Home Page:https://orbax.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Any way to have CheckpointManager write earlier checkpoints?

hrbigelow opened this issue · comments

Hi,

I have a directory of checkpoints from a previous run in which I had set

    options = ocp.CheckpointManagerOptions(save_interval_steps=3000, max_to_keep=50)

I ran the code longer than expected, and now have 50 checkpoints:

126000
129000
132000
135000
...
273000

I had wanted to have all of the checkpoints starting from 0, so I should have chosen a larger max_to_keep. I restarted from the beginning, now using max_to_keep=200, but it seems that the CheckpointManager is not saving the checkpoints 0, 3000, 6000, etc, even though the directory doesn't have 200 items yet.

Is this expected behavior, and is there any easy way I can get Orbax to fill in older checkpoints up to max_to_keep and then start to delete them based on the checkpoint order, not the order in which they were written?

Thanks,

Henry

If you want all the checkpoints starting from 0, it sounds like you should not be using max_to_keep at all, but setting it to None so no checkpoints are deleted?

I think probably the issue is that the earlier steps (0, 3000, etc.) are not getting saved at all because they are earlier than the existing steps, and thus out of sequence. Can you check ckpt_mngr.should_save(0)? I would expect this to return false. You can force a save on this step using ckpt_mngr.save(0, state, force=True). Note that you would still get an error with force if the checkpoint already exists. And force would also override the interval, so you would get checkpoints 1000 and 2000, etc.

Ahh, yeah, I suspected that orbax might have a rule that it doesn't save a checkpoint if a later one exists. As far as I know it isn't documented though, so I wanted to make sure things were working as expected. In any case, this isn't much of an issue - I used force=True and it works fine. Thanks for the response.