Async checkpointer failure handling
hr0nix opened this issue · comments
We've observed the following behavior of orbax-checkpoint:
- If async checkpoint saving fails, the main thread continues working.
- In case that's relevant, checkpoint saving failed in our case because we were writing to a slow network disk and hit a barrier timeout in multi-host training setup.
- During saving of the next checkpoint, orbax tries to remove the previous checkpoint but can't find it because it wasn't successfully created and, thus, fails.
Is there a way to handle this situation more gracefully? For instance, have a way to fail training immediately when async saving fails.
I've reproduced your error, I think we're just not communicating errors from the background thread in CheckpointManager back correctly. Should have a fix in by tomorrow!
FYI this should be fixed with this change: https://github.com/google/orbax/pull/606/files
Thanks a lot for the quick fix!