Support executing query stages in execution engines other than DataFusion
andygrove opened this issue · comments
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When the executor receives a task, it deserializes the physical plan, wraps it in a ShuffleWriterExec
, and executes it with DataFusion.
I want the ability to override this behavior to execute the plan in execution engines other than DataFusion.
Describe the solution you'd like
In the executor, we call new_shuffle_writer
to create the ShuffleWriterExec
that wraps the plan to be executed. I am thinking about moving that method into a new ExecuionEngine
trait and creating a DataFusionExecutor
implementation of the trait that is used by default.
We can then add a field to ExecutorProcessConfig
as follows:
execution_engine: Option<Arc<dyn ExecutionEngine>>
This will allow me to register custom execution engines from PyBallista, and execute distributed queries in Polars, Pandas, and cuDF.
Describe alternatives you've considered
None
Additional context
None
@Dandandan @thinkharderdev @yahoNanJing @avantgardnerio @jdye64 fyi - let me know if you have any opinions on this approach. I am going to build a prototype of this over the next week. I am sure the design will evolve as I try and implement this.
I have been thinking about this a lot today. I have had numerous ideas and all seem to have fell flat as I tried to fully implement them. I like the general idea however and curious to see how it looks fully materialized. Interesting stuff for sure!
There needs to be a scheduler element to this as well so we can do the plan translation once rather than per task.
@andygrove
I like this idea. For the other execution engines, do you have any proposal ?
I am closing this for now because I think it is too ambitious given the current level of development in the project.