pangeo-forge / pangeo-forge-recipes

Recent discussion with @ljstrnadiii got me wondering how reusable our sequential functions are outside the Beam context. In general, we aim to follow this Beam programming guide best practice:

pangeo-forge-recipes/pangeo_forge_recipes/transforms.py

Lines 37 to 43 in c292777

    
           # - Expose large, non-trivial, reusable sequential bits of the transform’s code, 
        
           #   which others might want to reuse in ways you haven’t anticipated, as a regular 
        
           #   function or class library. The transform should simply wire this logic together. 
        
           #   As a side benefit, you can unit-test those functions and classes independently. 
        
           #   Example: when developing a transform that parses files in a custom data format, 
        
           #   expose the format parser as a library; likewise for a transform that implements 
        
           #   a complex machine learning algorithm, etc.

In theory, this means those parts of our code could be wrapped in some other, non-Beam, parallelization framework, such as Flyte (a task orchestrator, which Len has experience with), or possibly Bytewax (another dataflow model, which has come up in our Coordination meetings). In practice, I'm not sure how difficult this would be.

Opening this issue for further discussion, particularly as a place for ongoing discussion with Len re: Flyte, but also on this subject more generally. The maximalist approach to this question would be to ask what it would take to actually support various data-parallel interfaces in Pangeo Forge. Having just come off the major Beam refactor effort, I think it's fair to say we don't have the appetite for that just yet, but big picture that's not entirely off the table. For the near term, I'm thinking more along the lines of supporting others to do this wrapping themselves.

To add my 2¢: just like Dask, I think the best abstraction would be to contribute Flyte or Bytewax runners to the Beam project.

	# - Expose large, non-trivial, reusable sequential bits of the transform’s code,
	# which others might want to reuse in ways you haven’t anticipated, as a regular
	# function or class library. The transform should simply wire this logic together.
	# As a side benefit, you can unit-test those functions and classes independently.
	# Example: when developing a transform that parses files in a custom data format,
	# expose the format parser as a library; likewise for a transform that implements
	# a complex machine learning algorithm, etc.

How reusable are our sequential functions (e.g., in Flyte, Bytewax, etc.)?