How reusable are our sequential functions (e.g., in Flyte, Bytewax, etc.)?
cisaacstern opened this issue · comments
Recent discussion with @ljstrnadiii got me wondering how reusable our sequential functions are outside the Beam context. In general, we aim to follow this Beam programming guide best practice:
pangeo-forge-recipes/pangeo_forge_recipes/transforms.py
Lines 37 to 43 in c292777
In theory, this means those parts of our code could be wrapped in some other, non-Beam, parallelization framework, such as Flyte (a task orchestrator, which Len has experience with), or possibly Bytewax (another dataflow model, which has come up in our Coordination meetings). In practice, I'm not sure how difficult this would be.
Opening this issue for further discussion, particularly as a place for ongoing discussion with Len re: Flyte, but also on this subject more generally. The maximalist approach to this question would be to ask what it would take to actually support various data-parallel interfaces in Pangeo Forge. Having just come off the major Beam refactor effort, I think it's fair to say we don't have the appetite for that just yet, but big picture that's not entirely off the table. For the near term, I'm thinking more along the lines of supporting others to do this wrapping themselves.
To add my 2¢: just like Dask, I think the best abstraction would be to contribute Flyte or Bytewax runners to the Beam project.