apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine

Home Page:https://datafusion.apache.org/ballista

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributed Execution

bubbajoe opened this issue · comments

Hello,

I am very new to rust so please bare with me.

So I would like to make a query on a large amount of data (50 GB of Parquet files) across multiple executors. But I am wondering how ballista handles this. Can it execute heavy loads like this even if node running it will only have 16 GB of memory.

  1. How can I determine the memory required for an execution plan?

  2. Does ballista execute a single query on multiple executors? If not, how can I optimize this?

  1. I'm not sure how you would determine the appropriate amount of memory without just trying it out. Ballista by no means loads all 50GB into memory at the same time - it breaks it up into smaller RecordBatches for processing.
  2. Ballista will run your query on as many executors as it can successfully parallelize (likely as many as you give it, depending on the query).