pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.

Home Page:https://pangeo-forge.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support skipping + retry recipes for failure recovery (aka "skipsies")

abarciauskas-bgse opened this issue · comments

I understand a common problem is having failures on some, but not all, source files. It is nearly impossible to run a massively parallel job and not face some sort of connection issue or other unexpected error from opening a file.

It would be great if there were a way to skip over failures, perhaps by writing nan's for the expected dimensions, log the failure, and then run a retry version of the same recipe which tried to fill in those gaps.

cc @ranchodeluxe @norlandrhagen @sharkinsspatial (who came up with the name "skipsies"

Julius Buseke has been running a bunch of the CMIP6 archive through pangeo-forge-recipes (on dataflow). I can ask him if he has found any good ways to re-run failed jobs and keep track of them.

Ha, I had a similar ticket I closed yesterday 😄

I like the Nan route as a last resort

Later today I plan to crosswalk what Flink/Beam have for checkpointing (which is another way to solve this). But it depends on the runner. Running with LocalDirectBakery on a decent sized machine still produces network issues for an auth-fronted s3 bucket. Will also compare to public bucket also