Refactor!: Distributed Query Processing
gangliao opened this issue · comments
Gang Liao commented
Gang Liao commented
- I should partition the logical plan instead of the physical plan.
- This is because the logical plan is really small and physical plan generation is extremely fast. #407
- We can also put the logical plans in the cloud context.
Gang Liao commented
q0 generates physical plan in 296 us
q1 generates physical plan in 236 us
q2 generates physical plan in 234 us
q3 generates physical plan in 908 us
q4 generates physical plan in 1720 us
q5 generates physical plan in 1808 us
q6 generates physical plan in 2401 us
q7 generates physical plan in 1263 us
q8 generates physical plan in 1350 us
q9 generates physical plan in 1860 us
q10 generates physical plan in 224 us
q11 generates physical plan in 598 us
q13 generates physical plan in 800 us
q12 generates physical plan in 622 us
Gang Liao commented
- Ballista also partition the physical plan directly: https://github.com/apache/arrow-datafusion/tree/master/ballista
Maybe we should keep the current implementation.
Gang Liao commented
- Okay, we should keep the physical plan partition since Spark is also implemented in this way.
Gang Liao commented
[x] Ballista simply split each repartition operator in the physical plan into two operators (ShuffleWriterExec and ShuffleReaderExec) for distributed processing.