Sqlglot 23.0.0 breaks EM Training
ADBond opened this issue · comments
What happens?
Via _get_comparison_levels_corresponding_to_training_blocking_rule
and ComparisonLevel._is_exact_match
we end up with the error.
if identifiers[0] == identifiers[1]:
~~~~~~~~~~~^^^
IndexError: list index out of range
Working on v22.2.1 (as in lockfile) - not sure of the exact sqlglot
version where this breaks
To Reproduce
With sqlglot==23.0.0
:
import splink.duckdb.comparison_library as cl
from splink.duckdb.linker import DuckDBLinker
from splink.datasets import splink_datasets
settings = {
"link_type": "dedupe_only",
"unique_id_column_name": "unique_id",
"comparisons": [
cl.levenshtein_at_thresholds(
"first_name", [2], term_frequency_adjustments=True
),
cl.levenshtein_at_thresholds("surname", [1], term_frequency_adjustments=True),
],
}
df = splink_datasets.fake_1000
linker = DuckDBLinker(df, settings)
blocking_rule = "l.first_name = r.first_name"
training_session_fn = linker.estimate_parameters_using_expectation_maximisation(
blocking_rule
)
OS:
Debian
Splink version:
3.9.13
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
Given
sql = "mycol_l = mycol_r"
tr = sqlglot.parse_one(sql)
for i in sql_syntax_tree.walk():
# do something
i
used to be a tuple of length 3 until sqlglot 23.0.0
Now it's not a tuple, but a sqlglot expression