google / orbax

Orbax provides common utility libraries for JAX users.

Home Page:https://orbax.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Async checkpointer failure handling

hr0nix opened this issue · comments

We've observed the following behavior of orbax-checkpoint:

  • If async checkpoint saving fails, the main thread continues working.
    • In case that's relevant, checkpoint saving failed in our case because we were writing to a slow network disk and hit a barrier timeout in multi-host training setup.
  • During saving of the next checkpoint, orbax tries to remove the previous checkpoint but can't find it because it wasn't successfully created and, thus, fails.

Is there a way to handle this situation more gracefully? For instance, have a way to fail training immediately when async saving fails.

I've reproduced your error, I think we're just not communicating errors from the background thread in CheckpointManager back correctly. Should have a fix in by tomorrow!

FYI this should be fixed with this change: https://github.com/google/orbax/pull/606/files

Thanks a lot for the quick fix!