splitgraph / seafowl

Analytical database for data-driven Web applications 🪶

Home Page:https://seafowl.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Remote tables: filter without sort

backkem opened this issue · comments

I was wondering: does the datafusion_remote_tables filter push-down not support sorting? It seems that using filters and limits in the absence of a sort order could lead to un-expected results.

I'd be happy to help address this if this is indeed the case.

Hey @backkem, that's a good question.

We abide by the TableProvider API set out by DataFusion which doesn't take into account the ORDER BY clause:

async fn scan(
&self,
_ctx: &SessionState,
projection: Option<&Vec<usize>>,
filters: &[Expr],
limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>> {

Sorting itself is handled by DataFusion further down the data processing pipeline (i.e. once the data has been fetched) by a plan node above the scanning node in the plan AST.

While in principle filtering and sorting are commutative, the limit doesn't commute with sorting. DataFusion handles this by carefully deciding when to push-down the limit down into the scan (hence why it's an Option<usize>), though I forgot where exactly that occurs.

Thank you for the feedback. I'll try to find some time to look into the directions mentioned in apache/datafusion#7871.

Closing as this was answered.
FYI: We created datafusion-contrib/datafusion-federation to explore the full query federation use-case.