moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ParseError: Required keyword: 'this' missing for <class 'sqlglot.expressions.EQ'>

theimanph opened this issue · comments

What happens?

Hi,

I am trying to run the spark example: https://moj-analytical-services.github.io/splink/demos/examples/spark/deduplicate_1k_synthetic.html and the error I am getting is: ParseError: Required keyword: 'this' missing for <class 'sqlglot.expressions.EQ'>. Line 1, Col: 65.
l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1

Any ideas on what is going wrong and what I can do about it? Thank you!!

Sincerely,

tom

To Reproduce

%pip install pyspark pyspark==3.4.1
#%pip install pyspark
%pip install --upgrade --force-reinstall pyarrow
%pip install pyodbc
%pip install duckdb
%pip install splink
%pip install usaddress
%pip install nbformat

import pyodbc
import os
import pandas as pd
import re
import usaddress
import time

from pyspark.sql import SparkSession
import pyspark.sql.functions as pyfuncs
from pyspark.sql.types import *
from pyspark.sql import Window

from splink.spark.jar_location import similarity_jar_location
path = similarity_jar_location()

print('create spark sesh')
spark = SparkSession
.builder
.appName("tomssplinktest")
.config("spark.master", "spark://ddlas01.hosted.lac.com:7077")
.config("spark.executor.memory", "45g")
.config("spark.driver.memory", "10g")
.config('spark.executor.cores', '1')
.config('spark.cores.max', '8')
.config('spark.executor.instances', '1')
.config('spark.jars', path)
.config('spark.sql.parquet.int96RebaseModeInWrite', "CORRECTED")
.getOrCreate()
print("created spark sesh!")

Disable warnings for pyspark - you don't need to include this

import warnings
spark.sparkContext.setLogLevel("ERROR")
warnings.simplefilter("ignore", UserWarning)

from splink.datasets import splink_datasets
pandas_df = splink_datasets.fake_1000

df = spark.createDataFrame(pandas_df)

import splink.spark.comparison_library as cl
import splink.spark.comparison_template_library as ctl
from splink.spark.blocking_rule_library import block_on

settings = {
"link_type": "dedupe_only",
"comparisons": [
ctl.name_comparison("first_name"),
ctl.name_comparison("surname"),
ctl.date_comparison("dob", cast_strings_to_date=True),
cl.exact_match("city", term_frequency_adjustments=True),
ctl.email_comparison("email", include_username_fuzzy_level=False),
],
"blocking_rules_to_generate_predictions": [
block_on("first_name"),
"l.surname = r.surname", # alternatively, you can write BRs in their SQL form
],
"retain_matching_columns": True,
"retain_intermediate_calculation_columns": True,
"em_convergence": 0.01
}

from splink.spark.linker import SparkLinker
linker = SparkLinker(df, settings, spark=spark)
deterministic_rules = [
"l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
"l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
"l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
"l.email = r.email"
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)

OS:

linux

Splink version:

most recent pypi version

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

This seems to be an error originating from sqlglot somehow.
Would you be able to post the version of sqlglot you have installed (and other package versions)? And the full error stacktrace?

Hi, the full list of the stack is here:

altair==5.2.0
arrow==1.3.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
attrs==23.2.0
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
clarabel==0.7.0
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1704278392174/work
cramjam==2.8.1
cvxopt==1.3.2
cvxpy==1.4.2
debugpy @ file:///home/conda/feedstock_root/build_artifacts/debugpy_1695534290440/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
DedupliPy==0.8
duckdb==0.9.2
ecos==2.0.13
et-xmlfile==1.1.0
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1698579936712/work
fancyimpute==0.7.0
fastjsonschema==2.19.1
fastparquet==2024.2.0
fsspec==2024.2.0
future==0.18.3
idna==3.6
importlib-metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1703269254275/work
iniconfig==2.0.0
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1705417941265/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1706795662110/work
ipywidgets==8.1.2
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2==3.1.3
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1699283905679/work
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1704727045866/work
jupyterlab_widgets==3.0.10
knnimpute==0.1.0
Levenshtein==0.25.0
MarkupSafe==2.1.5
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work
modAL-python==0.4.2.1
nbformat==5.9.2
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
networkx==3.2.1
nose==1.3.7
numpy==1.26.4
openpyxl==3.1.2
osqp==0.6.5
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1696202382185/work
pandas==2.2.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
phonetics==1.0.5
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1706713388748/work
pluggy==1.4.0
probableparsing==0.0.1
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1702399386289/work
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1705722403006/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1642875951954/work
py4j==0.10.9.7
pyarrow==15.0.0
pybind11==2.11.1
pycparser==2.21
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1700607939962/work
PyMinHash==0.1.5
pyodbc==5.0.1
pyspark==3.4.1
pytest==8.0.2
python-crfsuite==0.9.10
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
python-Levenshtein==0.25.0
pytz==2024.1
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1701783162530/work
qdldl==0.1.7.post0
rapidfuzz==3.6.1
referencing==0.33.0
requests==2.31.0
rpds-py==0.17.1
rpy2==3.5.15
scikit-learn==1.4.1.post1
scipy==1.11.4
scs==3.2.4.post1
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
skorch==0.9.0
splink==3.9.12
sqlglot==18.17.0
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
tabulate==0.9.0
thefuzz==0.22.1
threadpoolctl==3.3.0
toolz==0.12.1
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1695373450800/work
tqdm==4.66.2
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1704212992681/work
types-python-dateutil==2.8.19.20240106
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1702176139754/work
tzdata==2023.4
tzlocal==5.2
urllib3==2.2.0
usaddress==0.5.10
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
widgetsnbextension==4.0.10
zingg==0.4.0
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work

The full error trace is here:

--WARN--
You are using datediff comparison
with str-casting and ANSI is not enabled. Bad dates
e.g. 1999-13-54 will not trigger an exception but will
classed as comparison level = "ELSE". Ensure date strings
are cleaned to remove bad dates


ParseError Traceback (most recent call last)
Cell In[12], line 10
2 linker = SparkLinker(df, settings, spark=spark)
3 deterministic_rules = [
4 "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
5 "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
6 "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
7 "l.email = r.email"
8 ]
---> 10 linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/linker.py:3695, in Linker.estimate_probability_two_random_records_match(self, deterministic_matching_rules, recall)
3692 if isinstance(deterministic_matching_rules, str):
3693 deterministic_matching_rules = [deterministic_matching_rules]
-> 3695 records = cumulative_comparisons_generated_by_blocking_rules(
3696 self,
3697 deterministic_matching_rules,
3698 )
3700 summary_record = records[-1]
3701 num_observed_matches = summary_record["cumulative_rows"]

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/analyse_blocking.py:86, in cumulative_comparisons_generated_by_blocking_rules(linker, blocking_rules, output_chart, return_dataframe)
83 cartesian = calculate_cartesian(row_count_df, settings_obj._link_type)
85 # Calculate the total number of rows generated by each blocking rule
---> 86 sql_infos = block_using_rules_sqls(linker)
87 for sql_info in sql_infos:
88 linker._enqueue_sql(sql_info["sql"], sql_info["output_table_name"])

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/blocking.py:542, in block_using_rules_sqls(linker)
539 br_sqls = []
541 for br in blocking_rules:
--> 542 sql = br.create_blocked_pairs_sql(linker, where_condition, probability)
543 br_sqls.append(sql)
545 sql = " UNION ALL ".join(br_sqls)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/blocking.py:112, in BlockingRule.create_blocked_pairs_sql(self, linker, where_condition, probability)
111 def create_blocked_pairs_sql(self, linker: Linker, where_condition, probability):
--> 112 columns_to_select = linker._settings_obj._columns_to_select_for_blocking
113 sql_select_expr = ", ".join(columns_to_select)
115 sql = f"""
116 select
117 {sql_select_expr}
(...)
125 {self.exclude_pairs_generated_by_all_preceding_rules_sql(linker)}
126 """

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/settings.py:222, in Settings._columns_to_select_for_blocking(self)
219 cols = []
221 for uid_col in self._unique_id_input_columns:
--> 222 cols.append(uid_col.l_name_as_l)
223 cols.append(uid_col.r_name_as_r)
225 for cc in self.comparisons:

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/input_column.py:256, in InputColumn.l_name_as_l(self)
254 @Property
255 def l_name_as_l(self) -> str:
--> 256 alias = self.unquote().name_l
257 return replace(self.col_builder, table="l", alias=alias).sql

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/input_column.py:219, in InputColumn.unquote(self)
218 def unquote(self) -> InputColumn:
--> 219 self_copy = deepcopy(self)
220 b = replace(self_copy.col_builder, quoted=False)
221 self_copy.col_builder = b

File ~/miniforge3/envs/tomenv2/lib/python3.9/copy.py:172, in deepcopy(x, memo, _nil)
170 y = x
171 else:
--> 172 y = _reconstruct(x, memo, *rv)
174 # If is its own copy, don't memoize.
175 if y is not x:

File ~/miniforge3/envs/tomenv2/lib/python3.9/copy.py:270, in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
268 if state is not None:
269 if deep:
--> 270 state = deepcopy(state, memo)
271 if hasattr(y, 'setstate'):
272 y.setstate(state)

File ~/miniforge3/envs/tomenv2/lib/python3.9/copy.py:146, in deepcopy(x, memo, _nil)
144 copier = _deepcopy_dispatch.get(cls)
145 if copier is not None:
--> 146 y = copier(x, memo)
147 else:
148 if issubclass(cls, type):

File ~/miniforge3/envs/tomenv2/lib/python3.9/copy.py:230, in _deepcopy_dict(x, memo, deepcopy)
228 memo[id(x)] = y
229 for key, value in x.items():
--> 230 y[deepcopy(key, memo)] = deepcopy(value, memo)
231 return y

File ~/miniforge3/envs/tomenv2/lib/python3.9/copy.py:153, in deepcopy(x, memo, _nil)
151 copier = getattr(x, "deepcopy", None)
152 if copier is not None:
--> 153 y = copier(memo)
154 else:
155 reductor = dispatch_table.get(cls)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/settings.py:87, in Settings.deepcopy(self, memo)
83 def deepcopy(self, memo) -> Settings:
84 """When we do EM training, we need a copy of the Settings which is independent
85 of the original e.g. modifying the copy will not affect the original.
86 This method implements ensures the Settings can be deepcopied."""
---> 87 cc = Settings(self.as_dict())
88 return cc

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/settings.py:80, in Settings.init(self, settings_dict)
76 self._warn_if_no_null_level_in_comparisons()
78 self._additional_cols_to_retain = self._get_raw_additional_cols_to_retain
79 self._additional_columns_to_retain_list = (
---> 80 self._get_additional_columns_to_retain()
81 )

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/settings.py:129, in Settings._get_additional_columns_to_retain(self)
126 used_by_brs = []
127 for br in self._blocking_rules_to_generate_predictions:
128 used_by_brs.extend(
--> 129 get_columns_used_from_sql(br.blocking_rule_sql, br.sql_dialect)
130 )
132 used_by_brs = [InputColumn(c) for c in used_by_brs]
134 used_by_brs = [c.unquote().name for c in used_by_brs]

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/parse_sql.py:12, in get_columns_used_from_sql(sql, dialect, retain_table_prefix)
10 def get_columns_used_from_sql(sql, dialect=None, retain_table_prefix=False):
11 column_names = set()
---> 12 syntax_tree = sqlglot.parse_one(sql, read=dialect)
14 for subtree in syntax_tree.find_all(exp.Column):
15 # check if any parents are lambdas
16 parent = subtree.parent

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/init.py:125, in parse_one(sql, read, dialect, into, **opts)
123 result = dialect.parse_into(into, sql, **opts)
124 else:
--> 125 result = dialect.parse(sql, **opts)
127 for expression in result:
128 if not expression:

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/dialects/dialect.py:311, in Dialect.parse(self, sql, **opts)
310 def parse(self, sql: str, **opts) -> t.List[t.Optional[exp.Expression]]:
--> 311 return self.parser(**opts).parse(self.tokenize(sql), sql)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:986, in Parser.parse(self, raw_tokens, sql)
972 def parse(
973 self, raw_tokens: t.List[Token], sql: t.Optional[str] = None
974 ) -> t.List[t.Optional[exp.Expression]]:
975 """
976 Parses a list of tokens and returns a list of syntax trees, one tree
977 per parsed SQL statement.
(...)
984 The list of the produced syntax trees.
985 """
--> 986 return self._parse(
987 parse_method=self.class._parse_statement, raw_tokens=raw_tokens, sql=sql
988 )

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:1052, in Parser._parse(self, parse_method, raw_tokens, sql)
1049 self._tokens = tokens
1050 self._advance()
-> 1052 expressions.append(parse_method(self))
1054 if self._index < len(self._tokens):
1055 self.raise_error("Invalid expression / Unexpected token")

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:1241, in Parser._parse_statement(self)
1238 if self._match_set(Tokenizer.COMMANDS):
1239 return self._parse_command()
-> 1241 expression = self._parse_expression()
1242 expression = self._parse_set_operations(expression) if expression else self._parse_select()
1243 return self._parse_query_modifiers(expression)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:3175, in Parser._parse_expression(self)
3174 def _parse_expression(self) -> t.Optional[exp.Expression]:
-> 3175 return self._parse_alias(self._parse_conjunction())

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:3178, in Parser._parse_conjunction(self)
3177 def _parse_conjunction(self) -> t.Optional[exp.Expression]:
-> 3178 return self._parse_tokens(self._parse_equality, self.CONJUNCTION)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:4840, in Parser._parse_tokens(self, parse_method, expressions)
4837 def _parse_tokens(
4838 self, parse_method: t.Callable, expressions: t.Dict
4839 ) -> t.Optional[exp.Expression]:
-> 4840 this = parse_method()
4842 while self._match_set(expressions):
4843 this = self.expression(
4844 expressions[self._prev.token_type],
4845 this=this,
4846 comments=self._prev_comments,
4847 expression=parse_method(),
4848 )

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:3181, in Parser._parse_equality(self)
3180 def _parse_equality(self) -> t.Optional[exp.Expression]:
-> 3181 return self._parse_tokens(self._parse_comparison, self.EQUALITY)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:4843, in Parser._parse_tokens(self, parse_method, expressions)
4840 this = parse_method()
4842 while self._match_set(expressions):
-> 4843 this = self.expression(
4844 expressions[self._prev.token_type],
4845 this=this,
4846 comments=self._prev_comments,
4847 expression=parse_method(),
4848 )
4850 return this

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:1116, in Parser.expression(self, exp_class, comments, **kwargs)
1114 instance = exp_class(**kwargs)
1115 instance.add_comments(comments) if comments else self._add_comments(instance)
-> 1116 return self.validate_expression(instance)

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:1136, in Parser.validate_expression(self, expression, args)
1134 if self.error_level != ErrorLevel.IGNORE:
1135 for error_message in expression.error_messages(args):
-> 1136 self.raise_error(error_message)
1138 return expression

File ~/miniforge3/envs/tomenv2/lib/python3.9/site-packages/sqlglot/parser.py:1096, in Parser.raise_error(self, message, token)
1084 error = ParseError.new(
1085 f"{message}. Line {token.line}, Col: {token.col}.\n"
1086 f" {start_context}\033[4m{highlight}\033[0m{end_context}",
(...)
1092 end_context=end_context,
1093 )
1095 if self.error_level == ErrorLevel.IMMEDIATE:
-> 1096 raise error
1098 self.errors.append(error)

ParseError: Required keyword: 'this' missing for <class 'sqlglot.expressions.EQ'>. Line 1, Col: 65.
l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1
Thank you!!

tom

I've run this with the deterministic rules list and the same version of sqlglot, and I don't get the errror

What happens if you run:

import sqlglot
sqlglot.parse_one("l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1")

What do you get if you run:

import sqlglot
sqlglot.__version__

?

Hi, From this query: import sqlglot
sqlglot.parse_one("l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1")

I get: (AND this:
(EQ this:
(COLUMN this:
(IDENTIFIER this: first_name, quoted: False), table:
(IDENTIFIER this: l, quoted: False)), expression:
(COLUMN this:
(IDENTIFIER this: first_name, quoted: False), table:
(IDENTIFIER this: r, quoted: False))), expression:
(LTE this:
(LEVENSHTEIN this:
(COLUMN this:
(IDENTIFIER this: dob, quoted: False), table:
(IDENTIFIER this: r, quoted: False)), expression:
(COLUMN this:
(IDENTIFIER this: dob, quoted: False), table:
(IDENTIFIER this: l, quoted: False))), expression:
(LITERAL this: 1, is_string: False)))

From the second query: import sqlglot
sqlglot.version

I get: '18.17.0'

Thank you!!

hmm - something strange seems to be going on, because splink seems to be generating an error when it attempts to run sqlglot.parse_one("l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1"), but when you run the same statement in isolation, it doesn't error.

I have run your suggested script on my side under the same splink/sqlglot versions without error.

You're absolutely certain that in the environment you're running splink, it's definitely sqlglot==18.17.0 and it isn't somehow using an alternative environment for your Splink work?

For instance, immediately after you get the error, you're able to run sqlglot.parse_one("l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1") successfully?

I am... I ran the import sqlglot sqlglot.version code in the same jupyterlab notebook..

Do you get the same error if you run the same code in duckdb?


from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000


import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on

settings = {
    "link_type": "dedupe_only",
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email", include_username_fuzzy_level=False),
    ],
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        "l.surname = r.surname",  # alternatively, you can write BRs in their SQL form
    ],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": True,
    "em_convergence": 0.01,
}


linker = DuckDBLinker(df, settings)
deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2",
    "l.email = r.email",
]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)

Hi,

I get the results that I expected: Probability two random records match is estimated to be 0.00389.
This means that amongst all possible pairwise record comparisons, one in 257.25 are expected to match. With 499,500 total possible comparisons, we expect a total of around 1,941.67 matching pairs

The DuckDB version works really well!! Unfortunately, we have 94 million records to process so that wont do it.. Thats why I have been trying to get the spark version going.. Thank you!!

Sincerely,

tom

Struggling a bit with what to suggest, sorry! You could try upgrading to latest sqlglot I guess., or maybe downgrade splink to an earlier version, say 3.9.5?

Hi, I tried both upgrading down grading both Splink and sqlglot.. Same error..

Hi - I don't know if this helps, but I got exactly this error (same version of sqlglot) when I copy & pasted this example from the github.io page. This page has the deterministic rules written with "&lt;= 1" at the end of the instead of "<=1", which causes a parse error. The demo file in the repo (docs/demos/examples/spark/deduplicate_1k_synthetic.ipynb) has the correct text for the rules.

Hi - I don't know if this helps, but I got exactly this error (same version of sqlglot) when I copy & pasted this example from the github.io page. This page has the deterministic rules written with "&lt;= 1" at the end of the instead of "<=1", which causes a parse error. The demo file in the repo (docs/demos/examples/spark/deduplicate_1k_synthetic.ipynb) has the correct text for the rules.

Thanks so much for this @mattjbishop! I think this looks like it is probably the issue - @theimanph if I look at the source text of your comment I see that the code block has exactly these character entity refs:

deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) &lt;= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) &lt;= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) &lt;= 2",
    "l.email = r.email"
]

I think because the code was not contained in a codeblock github renders '&lt;' as '<' automatically, so we never spotted the problem with the code as it appears fine when rendered on github

Hi,

Thank you!! I am trying to run that version of the code and am now running into this error: ImportError: cannot import name 'block_on' from 'splink.spark.blocking_rule_library' (/home/c265616/miniforge3/envs/tomenv2/lib/python3.9/site-packages/splink/spark/blocking_rule_library.py) .. Any ideas? Thank you!!

Sincerely,

tom

Not sure - possibly you don't have the latest version of splink. feel free to ask a question in the discussion forum. Closing this issue

Hi Robin,

Thank you!!

Sincerely,

tom