dbt diff() macro
dbeatty10 opened this issue · comments
I'm interested in prototyping the following proposal.
Goal
Enable the DIFF
operator within a dbt project.
Implementation proposal
- Create a suite of dbt macros that mimic the SQL generated within datools/explanations.py
diff()
macro as the main interface- Result is a relation like this:
Potential syntax
with
this_week as (
select *
from {{ ref("logs") }}
where
crash = true
and timestamp between '2018-08-28' and '2018-09-04'
),
last_week as (
select *
from {{ ref("logs") }}
where
crash = true
and timestamp between '2018-08-21' and '2018-08-28'
)
{{ datools.diff(this_week, last_week, on=["app_version", "device_type", "os"], compare_by="risk_ratio", threshold="2.0", support="0.05", max_order="1") }}
Examples of dbt macros
These two dbt packages contain macros that might be useful for inspiration:
- https://github.com/dbt-labs/dbt-utils/tree/main/macros
- https://github.com/calogica/dbt-expectations/tree/main/macros
For example:
Thank you so much for your interest @dbeatty10! This is a really exciting proposal, as I'd love to see the library hooked into the existing data ecosystem. I can think of a few considerations. Hopefully this long-ish response comes off as excited and welcoming rather than off-putting or overwhelming. After you read it, let me know where/how I can help!
Separating query generation from query execution
Right now the explanations.diff
function does two things: 1) Goes through the process of generating a query, 2) Executes the query to return potential explanations.
In the code, the crossover from step 1 to step 2 is here:
datools/datools/explanations.py
Line 271 in 595d19c
In your proposal, you'd like to use part 1 without relying on part 2. Since part 1 isn't exposed as a public API just yet, one way to accomplish that would be to extract diff_query
(all the part 1 stuff) from diff
, and then both the dbt
wrapper and diff
can call it for execution purposes.
Query generation executes queries as well
The explanations.diff
query doesn't just "statically compile" SQL---it has to execute queries in order to know how to parametrize the SQL it's generating. This is a bit tougher to solve than the previous issue. Here are a few examples:
- To filter the explanations to only ones that pass the support threshold, the code queries for the number of rows in the test relation:
datools/datools/explanations.py
Line 250 in 595d19c
- In order to consider range-valued columns (e.g., floating point numbers) for explanations, the code breaks those columns into 15 equi-sized buckets (it queries the underlying relations to determine those buckets), and then rewrites the test/control relations to include bucketed versions of the columns:
datools/datools/explanations.py
Line 253 in 595d19c
The first issue seems easier to solve --- we can turn the support query into a subquery, and even benefit from one less round-trip to the database. The second issue will require a little bit of thought. Can dbt
handle "I need to run a query inside this macro," or should we brainstorm ways to push the bucketing/rewrite logic into the query as well?
If neither queries-inside-macros nor pushing down the queries works, we can still explore an alternative where bucketing is its own macro distinct from diffing, but I'd love to avoid changing the diff
API if at all possible.
Generating SQL beyond datools.diff
While diff
is certainly the most exciting part of datools
, there are other query generation goodies. For example, as part of making diff work on postgres, I've been implementing GROUPING SETS
support for databases that don't implement grouping sets natively (e.g., SQLite, Redshift) and ones that do (e.g., DuckDB, Postgres). You could imagine dbt
macros for that as well!