TobikoData / sqlmesh

Efficient data transformation and modeling framework that is backwards compatible with dbt.

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Invalid SQL generated for the ROW type in Trino unit test fixtures

erindru opened this issue · comments

Here is another Trino edge case which im not sure if it's SQLMesh or SQLGlot.

Given a model like so:

    name test.test_model,
    kind FULL

    max( as day,
    meta.type as type
from test.metadata
group by 1, 2

And a test like so:

  model: test.test_model
        meta: ROW(day date, type varchar)
        - meta:
            day: 2024-03-30
            type: foo
        - meta:
            day: 2024-03-31
            type: bar
        - meta:
            day: 2024-04-01
            type: baz
        - day: 2024-04-01
          type: baz

When running (against a default_test_connection pointed at Trino), it fails:

$ sqlmesh --debug test tests/test_model.yaml 
ERROR: test_sqlmesh (tests/test_model.yaml)
trino.exceptions.TrinoUserError: TrinoUserError(type=USER_ERROR, name=MISMATCHED_COLUMN_ALIASES, message="line 1:147: Column alias list has 1 entries but 't' has 2 columns available", query_id=20240430_192625_00091_yysrm)

The reason for this failure is the SQL that SQLMesh generates to create the test fixture. From the logs, it tries to execute:

CREATE OR REPLACE VIEW datalake.sqlmesh_test_g087nhkb."datalake__test__metadata" AS
SELECT CAST(meta AS ROW(day DATE, type VARCHAR)) AS meta
    (CAST(ROW(CAST('2024-03-30' AS DATE), 'foo') AS ROW(day DATE, type VARCHAR))),
    (CAST(ROW(CAST('2024-03-31' AS DATE), 'bar') AS ROW(day DATE, type VARCHAR))),
    (CAST(ROW(CAST('2024-04-01' AS DATE), 'baz') AS ROW(day DATE, type VARCHAR)))
) AS t(meta)

The problem is that when CAST(ROW(CAST('2024-03-30' AS DATE), 'foo') AS ROW(day DATE, type VARCHAR)) is put inside VALUES(), Trino unpacks the ROW into two columns. This can be shown like so:

Running SELECT CAST(ROW(CAST('2024-03-30' AS DATE), 'foo') AS ROW(day DATE, type VARCHAR)) correctly produces a ROW:

{day=2024-04-30, type=foo}

However, wrapping it in VALUES by running SELECT * FROM ( VALUES ( CAST(ROW(CAST('2024-03-30' AS DATE), 'foo') AS ROW(day DATE, type VARCHAR)) ) ) "helpfully" unpacks the ROW into multiple columns:

_col0 _col1
2024-04-30 foo

This is what causes the error: Column alias list has 1 entries but 't' has 2 columns available

The correct syntax re-assembles the ROW type from the top-level columns something like:

SELECT CAST((t.col1, t.col2) AS ROW(day DATE, type VARCHAR)) AS meta
    (CAST(ROW(CAST('2024-03-30' AS DATE), 'foo') AS ROW(day DATE, type VARCHAR))),
    (CAST(ROW(CAST('2024-03-31' AS DATE), 'bar') AS ROW(day DATE, type VARCHAR))),
    (CAST(ROW(CAST('2024-04-01' AS DATE), 'baz') AS ROW(day DATE, type VARCHAR)))
) AS t(col1, col2)

Alternatively - instead of SQLMesh trying to generate a query from the yaml to produce a test fixture, maybe we could add a feature that allows the user to supply their own SELECT query to produce the data?


Interesting! Yeah this looks like another SQLMesh edge case. Thanks for reporting Erin, I'll take a look soon.