Improve error handling

Question

nrutledge opened this issue 4 months ago · comments

We have not put a lot of thought into error handling at this point. We need to ensure the following:

If any step fails (e.g., upload to R2), there is a retry mechanism in place.
If things continue to fail after a certain number of retries, we are alerted of the failure.
Someone running the snapshot service locally can also receive alerts on failures.

This issue was brought up during the 2024/03/04 sync meeting.