dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask

Home Page:https://dask-sql.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[ENH] Expand inner/left/right join logic to properly handle joins on non-referenced columns

charlesbluca opened this issue · comments

Is your feature request related to a problem? Please describe.
Currently, when faced with an inner/left/right join condition that does not compare directly referenced columns (for example, if one of the columns is the result of a cast operation), we opt to instead to do an cross join and then filter the result by the computed join condiiton(s).

Given that cross joins raise a warning around resource intensiveness, I'm wondering if this could create a performance bottleneck for otherwise simple queries:

import pandas as pd
from dask_sql import Context

c = Context()
c.create_table("df1", pd.DataFrame({"a": [1, 2, 3, 4, 5]}))
c.create_table("df2", pd.DataFrame({"b": [1.0, 2.0, 3.0, 4.0, 5.0]}))

df = c.sql("SELECT * FROM df1 JOIN df2 ON a=b")

Describe the solution you'd like
I'd be interested in refactoring our join code so that _extract_lhs_rhs has handling for operands beyond just RexType.References, and reflecting that change across the remainder of the join code so that we pre-compute RexType.Call columns prior to the join operation and use them in an inner/left/right join rather than as filters for a cross join.

Additional context
Was thinking about this while writing out a test for #1153 (comment); this was initially a bit confusing because the EXPLAIN for the above query implies that an inner join would be executed, when in actuality we run through the cross join codepath on the Python side.