apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine

Home Page:https://datafusion.apache.org/ballista

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support executing query stages in execution engines other than DataFusion

andygrove opened this issue · comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When the executor receives a task, it deserializes the physical plan, wraps it in a ShuffleWriterExec, and executes it with DataFusion.

I want the ability to override this behavior to execute the plan in execution engines other than DataFusion.

Describe the solution you'd like
In the executor, we call new_shuffle_writer to create the ShuffleWriterExec that wraps the plan to be executed. I am thinking about moving that method into a new ExecuionEngine trait and creating a DataFusionExecutor implementation of the trait that is used by default.

We can then add a field to ExecutorProcessConfig as follows:

execution_engine: Option<Arc<dyn ExecutionEngine>>

This will allow me to register custom execution engines from PyBallista, and execute distributed queries in Polars, Pandas, and cuDF.

Describe alternatives you've considered
None

Additional context
None

@Dandandan @thinkharderdev @yahoNanJing @avantgardnerio @jdye64 fyi - let me know if you have any opinions on this approach. I am going to build a prototype of this over the next week. I am sure the design will evolve as I try and implement this.

I have been thinking about this a lot today. I have had numerous ideas and all seem to have fell flat as I tried to fully implement them. I like the general idea however and curious to see how it looks fully materialized. Interesting stuff for sure!

There needs to be a scheduler element to this as well so we can do the plan translation once rather than per task.

@andygrove
I like this idea. For the other execution engines, do you have any proposal ?

I am closing this for now because I think it is too ambitious given the current level of development in the project.