Leakage guards not applied during tests for incremental-by-time-range model
giovannipcarvalho opened this issue · comments
As I understand it, in an INCREMENTAL_BY_TIME_RANGE
model, whatever is returned by my query gets filtered to avoid inserting data from outside the date-interval being processed [1]
SQLMesh also uses the time column to automatically append a time range filter to the model's query at runtime, which prevents records that are not part of the target interval from being stored
I am observing a behavior which seems to indicate that these leakage guards are applied during normal execution (sqlmesh plan
) but are not applied during tests (sqlmesh test
).
versions:
λ sqlmesh --version
0.94.0
λ duckdb --version
v0.10.1 4a89d97db8
λ python --version --version
Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0]
setup:
incremental_by_time_range
model, which intentionallyunion
s data outside itsstart
andend
range
code
λ tail -n99 models/* seeds/* tests/* config.yaml
==> models/foo.sql <==
MODEL (
name hello.foo,
start '2024-04-01',
end '2024-12-12',
kind incremental_by_time_range(
time_column event_date
),
);
select event_date, 'ok' as y from hello.seed
where event_date between @start_ds and @end_ds
union all
select
'2030-01-01'::date as event_date,
'bad' as y
==> models/seed.sql <==
MODEL (
name hello.seed,
kind SEED (
path '../seeds/seed.csv'
),
columns (
event_date date
)
);
==> seeds/seed.csv <==
event_date
2024-04-01
2024-04-02
2024-04-03
==> tests/test_foo.yaml <==
test_foo:
model: hello.foo
inputs:
hello.seed:
- event_date: 2024-04-01
- event_date: 2024-04-02
- event_date: 2024-04-03
outputs:
query:
- event_date: 2024-04-01
y: ok
- event_date: 2024-04-02
y: ok
vars:
start: 2024-04-01
end: 2024-04-02
==> config.yaml <==
gateways:
local:
connection:
type: duckdb
database: db.db
default_gateway: local
model_defaults:
dialect: duckdb
repro:
λ rm -f db.db && sqlmesh plan --skip-tests --auto-apply
λ duckdb db.db 'select * from hello.foo' # no bad data here
┌────────────┬─────────┐
│ event_date │ y │
│ date │ varchar │
├────────────┼─────────┤
│ 2024-04-01 │ ok │
│ 2024-04-02 │ ok │
│ 2024-04-03 │ ok │
└────────────┴─────────┘
λ sqlmesh test # but 'bad' row is present here
F
======================================================================
FAIL: test_foo (/tmp/sqlmesh-leakage-guard/tests/test_foo.yaml)
----------------------------------------------------------------------
AssertionError: Data mismatch (rows are different)
Unexpected rows:
event_date y
0 2030-01-01 bad
----------------------------------------------------------------------
Ran 1 test in 0.022s
FAILED (failures=1)
- The 'bad' row is not present in
hello.foo
after plan, as I expected (it has a date which falls outside the model'sstart
andend
attributes, but I also verified by setting --start and --end manually in a separate environment) - The 'bad' row is present in the model output during tests, which I did not expect (start and end in
vars:
does not include the date from 2030) - Note that the third row from the seed data (which is replicated in the test case) is not included, as it also does not fall within start/end in
vars:
.
Is my expectation wrong, or maybe I am doing something wrong/unsupported?
Thanks in advance.
refs
[1] https://sqlmesh.readthedocs.io/en/stable/concepts/models/model_kinds/#time-column
the test runs the values of render_query, which doesn't have extras associated with the model kind.
this way it's easier to understand what your model is producing as you wrote it. one of the main intentions of unit tests is to ensure that the values hasn't change as you perform refactors or they change in expected ways.
in this case, data leakage is something there to prevent you from shooting yourself in the foot, but i would say it's not best practice to rely on this. so i think the test exposing that you have leaking data is a good thing and should be addressed in your model.
i'm closing this for now since it seems to be working as intended, and in this case found an issue in your query (even though sqlmesh's guard rails do catch it)
the test runs the values of render_query
ah, I overlooked this! I also agree it’s a sensible decision. Thanks for the quick reply.