Sqlglot 23.0.0 breaks EM Training

Question

Sqlglot 23.0.0 breaks EM Training

ADBond opened this issue 4 months ago · comments

What happens?

Via _get_comparison_levels_corresponding_to_training_blocking_rule and ComparisonLevel._is_exact_match we end up with the error.

if identifiers[0] == identifiers[1]:
       ~~~~~~~~~~~^^^
IndexError: list index out of range

Working on v22.2.1 (as in lockfile) - not sure of the exact sqlglot version where this breaks

To Reproduce

With sqlglot==23.0.0:

import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker
from splink.datasets import splink_datasets

settings = {
    "link_type": "dedupe_only",
    "unique_id_column_name": "unique_id",
    "comparisons": [
        cl.levenshtein_at_thresholds(
            "first_name", [2], term_frequency_adjustments=True
        ),
        cl.levenshtein_at_thresholds("surname", [1], term_frequency_adjustments=True),
    ],
}

df = splink_datasets.fake_1000

linker = DuckDBLinker(df, settings)
blocking_rule = "l.first_name = r.first_name"
training_session_fn = linker.estimate_parameters_using_expectation_maximisation(
    blocking_rule
)

OS:

Debian

Splink version:

3.9.13

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

Robin Linacre · Answer 1 · Wed Mar 20 2024 00:46:15 GMT+0800 (China Standard Time)

Given

sql = "mycol_l = mycol_r"
tr = sqlglot.parse_one(sql)
for i in sql_syntax_tree.walk():
    # do something

i used to be a tuple of length 3 until sqlglot 23.0.0

Now it's not a tuple, but a sqlglot expression

Robin Linacre · Answer 2 · Wed Mar 20 2024 00:56:34 GMT+0800 (China Standard Time)

tobymao/sqlglot@8f09dce

Sqlglot 23.0.0 breaks EM Training

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Have you tried this on the latest `master` branch?