marcua / datools

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dbt diff() macro

dbeatty10 opened this issue · comments

I'm interested in prototyping the following proposal.

Goal

Enable the DIFF operator within a dbt project.

Implementation proposal

  • Create a suite of dbt macros that mimic the SQL generated within datools/explanations.py
  • diff() macro as the main interface
  • Result is a relation like this:
    image

Potential syntax

with
this_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-28' and '2018-09-04'

),

last_week as (

    select *
    from {{ ref("logs") }}
    where
        crash = true
        and timestamp between '2018-08-21' and '2018-08-28'

)

{{ datools.diff(this_week, last_week, on=["app_version", "device_type", "os"], compare_by="risk_ratio", threshold="2.0", support="0.05", max_order="1") }}

Examples of dbt macros

These two dbt packages contain macros that might be useful for inspiration:

For example:

Thank you so much for your interest @dbeatty10! This is a really exciting proposal, as I'd love to see the library hooked into the existing data ecosystem. I can think of a few considerations. Hopefully this long-ish response comes off as excited and welcoming rather than off-putting or overwhelming. After you read it, let me know where/how I can help!

Separating query generation from query execution

Right now the explanations.diff function does two things: 1) Goes through the process of generating a query, 2) Executes the query to return potential explanations.

In the code, the crossover from step 1 to step 2 is here:

diff_query = _diff_query(

In your proposal, you'd like to use part 1 without relying on part 2. Since part 1 isn't exposed as a public API just yet, one way to accomplish that would be to extract diff_query (all the part 1 stuff) from diff, and then both the dbt wrapper and diff can call it for execution purposes.

Query generation executes queries as well

The explanations.diff query doesn't just "statically compile" SQL---it has to execute queries in order to know how to parametrize the SQL it's generating. This is a bit tougher to solve than the previous issue. Here are a few examples:

  • To filter the explanations to only ones that pass the support threshold, the code queries for the number of rows in the test relation:
    min_support_rows = floor(num_test_rows * min_support)
  • In order to consider range-valued columns (e.g., floating point numbers) for explanations, the code breaks those columns into 15 equi-sized buckets (it queries the underlying relations to determine those buckets), and then rewrites the test/control relations to include bucketed versions of the columns:
    range_statistics = range_valued_statistics(

The first issue seems easier to solve --- we can turn the support query into a subquery, and even benefit from one less round-trip to the database. The second issue will require a little bit of thought. Can dbt handle "I need to run a query inside this macro," or should we brainstorm ways to push the bucketing/rewrite logic into the query as well?

If neither queries-inside-macros nor pushing down the queries works, we can still explore an alternative where bucketing is its own macro distinct from diffing, but I'd love to avoid changing the diff API if at all possible.

Generating SQL beyond datools.diff

While diff is certainly the most exciting part of datools, there are other query generation goodies. For example, as part of making diff work on postgres, I've been implementing GROUPING SETS support for databases that don't implement grouping sets natively (e.g., SQLite, Redshift) and ones that do (e.g., DuckDB, Postgres). You could imagine dbt macros for that as well!