apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine

Home Page:https://datafusion.apache.org/ballista

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Shuffle] Support cache remote shuffle reader client in executor.

Ted-Jiang opened this issue · comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

In our env, when executor grows to 100+, when running sql (shuffle 100G) with 10 parallel. Got

Caused by: java.lang.RuntimeException: Query with job id wXHGdKf failed due to Job failed due to stage 3 failed: Task failed due to runtime execution error: DataFusionError(ArrowError(ExternalError("Arrow error: External error: Shuffle fetch partition error from Executor 10.69.147.148 : 7115d88d-1212-4a4d-95b7-58f5a68600be, map_stage 2, map_partition 3, error desc: Error connecting to Ballista scheduler or executor at http://10.69.147.148:50051/: tonic::transport::Error(Transport, hyper::Error(Connect, Custom { kind: TimedOut, error: Elapsed(()) })) @ GrpcConnectionError")))

Seems cache the client is a solution, All cut-edge systems like iox and tikv did this.
Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.