TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.

Home Page:https://sqlmesh.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Leakage guards not applied during tests for incremental-by-time-range model

giovannipcarvalho opened this issue · comments

As I understand it, in an INCREMENTAL_BY_TIME_RANGE model, whatever is returned by my query gets filtered to avoid inserting data from outside the date-interval being processed [1]

SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime, which prevents records that are not part of the target interval from being stored

I am observing a behavior which seems to indicate that these leakage guards are applied during normal execution (sqlmesh plan) but are not applied during tests (sqlmesh test).

versions:

λ sqlmesh --version
0.94.0
λ duckdb --version
v0.10.1 4a89d97db8
λ python --version --version
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0]

setup:

  • incremental_by_time_range model, which intentionally unions data outside its start and end range
code
λ tail -n99 models/* seeds/* tests/* config.yaml 
==> models/foo.sql <==
MODEL (
    name hello.foo,
    start '2024-04-01',
    end '2024-12-12',
    kind incremental_by_time_range(
        time_column event_date
    ),
);


select event_date, 'ok' as y from hello.seed
where event_date between @start_ds and @end_ds

union all

select
'2030-01-01'::date as event_date,
'bad' as y

==> models/seed.sql <==
MODEL (
    name hello.seed,
    kind SEED (
        path '../seeds/seed.csv'
    ),
    columns (
        event_date date
    )
);

==> seeds/seed.csv <==
event_date
2024-04-01
2024-04-02
2024-04-03

==> tests/test_foo.yaml <==
test_foo:
  model: hello.foo
  inputs:
    hello.seed:
      - event_date: 2024-04-01
      - event_date: 2024-04-02
      - event_date: 2024-04-03
  outputs:
    query:
      - event_date: 2024-04-01
        y: ok
      - event_date: 2024-04-02
        y: ok
  vars:
    start: 2024-04-01
    end: 2024-04-02

==> config.yaml <==
gateways:
    local:
        connection:
            type: duckdb
            database: db.db

default_gateway: local

model_defaults:
    dialect: duckdb

repro:

λ rm -f db.db && sqlmesh plan --skip-tests --auto-apply
λ duckdb db.db 'select * from hello.foo' # no bad data here
┌────────────┬─────────┐
│ event_date │    y    │
│    date    │ varchar │
├────────────┼─────────┤
│ 2024-04-01 │ ok      │
│ 2024-04-02 │ ok      │
│ 2024-04-03 │ ok      │
└────────────┴─────────┘

λ sqlmesh test # but 'bad' row is present here
F
======================================================================
FAIL: test_foo (/tmp/sqlmesh-leakage-guard/tests/test_foo.yaml)
----------------------------------------------------------------------
AssertionError: Data mismatch (rows are different)

Unexpected rows:

  event_date    y
0 2030-01-01  bad

----------------------------------------------------------------------
Ran 1 test in 0.022s

FAILED (failures=1)
  • The 'bad' row is not present in hello.foo after plan, as I expected (it has a date which falls outside the model's start and end attributes, but I also verified by setting --start and --end manually in a separate environment)
  • The 'bad' row is present in the model output during tests, which I did not expect (start and end in vars: does not include the date from 2030)
  • Note that the third row from the seed data (which is replicated in the test case) is not included, as it also does not fall within start/end in vars:.

Is my expectation wrong, or maybe I am doing something wrong/unsupported?

Thanks in advance.

refs

[1] https://sqlmesh.readthedocs.io/en/stable/concepts/models/model_kinds/#time-column

the test runs the values of render_query, which doesn't have extras associated with the model kind.

this way it's easier to understand what your model is producing as you wrote it. one of the main intentions of unit tests is to ensure that the values hasn't change as you perform refactors or they change in expected ways.

in this case, data leakage is something there to prevent you from shooting yourself in the foot, but i would say it's not best practice to rely on this. so i think the test exposing that you have leaking data is a good thing and should be addressed in your model.

i'm closing this for now since it seems to be working as intended, and in this case found an issue in your query (even though sqlmesh's guard rails do catch it)

the test runs the values of render_query

ah, I overlooked this! I also agree it’s a sensible decision. Thanks for the quick reply.