Async checkpointer failure handling

Question

Async checkpointer failure handling

hr0nix opened this issue 8 months ago · comments

We've observed the following behavior of orbax-checkpoint:

If async checkpoint saving fails, the main thread continues working.
- In case that's relevant, checkpoint saving failed in our case because we were writing to a slow network disk and hit a barrier timeout in multi-host training setup.
During saving of the next checkpoint, orbax tries to remove the previous checkpoint but can't find it because it wasn't successfully created and, thus, fails.

Is there a way to handle this situation more gracefully? For instance, have a way to fail training immediately when async saving fails.

Colin Gaffney · Answer 1 · Thu Nov 30 2023 08:47:07 GMT+0800 (China Standard Time)

I've reproduced your error, I think we're just not communicating errors from the background thread in CheckpointManager back correctly. Should have a fix in by tomorrow!

Colin Gaffney · Answer 2 · Fri Dec 01 2023 09:02:28 GMT+0800 (China Standard Time)

FYI this should be fixed with this change: https://github.com/google/orbax/pull/606/files

Boris Yangel · Answer 3 · Fri Dec 01 2023 09:23:48 GMT+0800 (China Standard Time)

Thanks a lot for the quick fix!