moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sqlglot 23.0.0 breaks EM Training

ADBond opened this issue · comments

What happens?

Via _get_comparison_levels_corresponding_to_training_blocking_rule and ComparisonLevel._is_exact_match we end up with the error.

if identifiers[0] == identifiers[1]:
       ~~~~~~~~~~~^^^
IndexError: list index out of range

Working on v22.2.1 (as in lockfile) - not sure of the exact sqlglot version where this breaks

To Reproduce

With sqlglot==23.0.0:

import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker
from splink.datasets import splink_datasets

settings = {
    "link_type": "dedupe_only",
    "unique_id_column_name": "unique_id",
    "comparisons": [
        cl.levenshtein_at_thresholds(
            "first_name", [2], term_frequency_adjustments=True
        ),
        cl.levenshtein_at_thresholds("surname", [1], term_frequency_adjustments=True),
    ],
}

df = splink_datasets.fake_1000

linker = DuckDBLinker(df, settings)
blocking_rule = "l.first_name = r.first_name"
training_session_fn = linker.estimate_parameters_using_expectation_maximisation(
    blocking_rule
)

OS:

Debian

Splink version:

3.9.13

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Given

sql = "mycol_l = mycol_r"
tr = sqlglot.parse_one(sql)
for i in sql_syntax_tree.walk():
    # do something

i used to be a tuple of length 3 until sqlglot 23.0.0

Now it's not a tuple, but a sqlglot expression