Refactor!: Distributed Query Processing

Question

Refactor!: Distributed Query Processing

gangliao opened this issue 3 years ago · comments

Gang Liao · Answer 1 · Tue Jan 04 2022 09:16:28 GMT+0800 (China Standard Time)

I should partition the logical plan instead of the physical plan.
- This is because the logical plan is really small and physical plan generation is extremely fast. #407
- We can also put the logical plans in the cloud context.

Gang Liao · Answer 2 · Tue Jan 04 2022 09:17:18 GMT+0800 (China Standard Time)

q0 generates physical plan in 296 us
q1 generates physical plan in 236 us
q2 generates physical plan in 234 us
q3 generates physical plan in 908 us
q4 generates physical plan in 1720 us
q5 generates physical plan in 1808 us
q6 generates physical plan in 2401 us
q7 generates physical plan in 1263 us
q8 generates physical plan in 1350 us
q9 generates physical plan in 1860 us
q10 generates physical plan in 224 us
q11 generates physical plan in 598 us
q13 generates physical plan in 800 us
q12 generates physical plan in 622 us

Gang Liao · Answer 3 · Tue Jan 04 2022 10:52:00 GMT+0800 (China Standard Time)

Ballista also partition the physical plan directly: https://github.com/apache/arrow-datafusion/tree/master/ballista
Maybe we should keep the current implementation.

Gang Liao · Answer 4 · Tue Jan 04 2022 11:07:27 GMT+0800 (China Standard Time)

Okay, we should keep the physical plan partition since Spark is also implemented in this way.

Gang Liao · Answer 5 · Tue Jan 04 2022 11:18:43 GMT+0800 (China Standard Time)

[x] Ballista simply split each repartition operator in the physical plan into two operators (ShuffleWriterExec and ShuffleReaderExec) for distributed processing.