Deduplication doesn't work `TypeError: Parser.__init__() got an unexpected keyword argument 'dialect'`
James-Osmond opened this issue · comments
What happens?
My colleague @Joe-Pawley and I have been trying without success to run a basic determinstic deduplication on a dataset. This exact code used to run when we were using Splink v3.9.0, but now that we have upgraded to 3.9.12, the same code does not run. I provide the code below. The dataset df
has columns for bothinitials
, sex
, dob
, and sector
, as well as a unique_id
column.
I checked the demos in this directory and cannot see what we are doing so differently that would be breaking the code. Do you have any idea what might be causing this error/ have you seen it before?
To Reproduce
from splink.duckdb.linker import DuckDBLinker
deterministic_brs = [
# Non null match on both initials
"l.bothinitials = r.bothinitials AND "
# Non null match on date of birth
"l.dob = r.dob AND "
# Non null match on sex
"l.sex = r.sex AND "
# Non null match on postcode sector
"l.sector = r.sector"
]
deterministic_settings = {
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": deterministic_brs
}
deterministic_linker = DuckDBLinker(
df, deterministic_settings
)
produces the following error:
---------------------------------------------------------------------------TypeError Traceback (most recent call last)
Cell In[7], line 5 1 deterministic_settings = { 2 "link_type": "dedupe_only", 3 "blocking_rules_to_generate_predictions": deterministic_brs 4 }----> 5 deterministic_linker = DuckDBLinker( 6 df, deterministic_settings 7 )
File ~\.conda\envs\cstr-env\lib\site-packages\splink\duckdb\linker.py:180, in DuckDBLinker.__init__(self, input_table_or_tables, settings_dict, connection, set_up_basic_logging, output_schema, input_table_aliases, validate_settings) 177 except ImportError: 178 pass--> 180 super().__init__( 181 input_tables, 182 settings_dict, 183 accepted_df_dtypes, 184 set_up_basic_logging, 185 input_table_aliases=input_aliases, 186 validate_settings=validate_settings, 187 ) 189 # Quickly check for casting error in duckdb/pandas 190 for i, (table, alias) in enumerate(zip(input_tables, input_aliases)):
File ~\.conda\envs\cstr-env\lib\site-packages\splink\linker.py:237, in Linker.__init__(self, input_table_or_tables, settings_dict, accepted_df_dtypes, set_up_basic_logging, input_table_aliases, validate_settings) 235 self._validate_settings_components(settings_dict) 236 settings_dict = deepcopy(settings_dict)--> 237 self._setup_settings_objs(settings_dict) 239 homogenised_tables, homogenised_aliases = self._register_input_tables( 240 input_table_or_tables, 241 input_table_aliases, 242 accepted_df_dtypes, 243 ) 245 self._input_tables_dict = self._get_input_tables_dict( 246 homogenised_tables, homogenised_aliases 247 )
File ~\.conda\envs\cstr-env\lib\site-packages\splink\linker.py:498, in Linker._setup_settings_objs(self, settings_dict) 496 self._settings_obj_ = None 497 else:--> 498 self._settings_obj_ = Settings(settings_dict)
File ~\.conda\envs\cstr-env\lib\site-packages\splink\settings.py:80, in Settings.__init__(self, settings_dict) 76 self._warn_if_no_null_level_in_comparisons() 78 self._additional_cols_to_retain = self._get_raw_additional_cols_to_retain 79 self._additional_columns_to_retain_list = (---> 80 self._get_additional_columns_to_retain() 81 )
File ~\.conda\envs\cstr-env\lib\site-packages\splink\settings.py:132, in Settings._get_additional_columns_to_retain(self) 127 for br in self._blocking_rules_to_generate_predictions: 128 used_by_brs.extend( 129 get_columns_used_from_sql(br.blocking_rule_sql, br.sql_dialect) 130 )--> 132 used_by_brs = [InputColumn(c) for c in used_by_brs] 134 used_by_brs = [c.unquote().name for c in used_by_brs] 135 already_used = self._columns_used_by_comparisons
File ~\.conda\envs\cstr-env\lib\site-packages\splink\settings.py:132, in <listcomp>(.0) 127 for br in self._blocking_rules_to_generate_predictions: 128 used_by_brs.extend( 129 get_columns_used_from_sql(br.blocking_rule_sql, br.sql_dialect) 130 )--> 132 used_by_brs = [InputColumn(c) for c in used_by_brs] 134 used_by_brs = [c.unquote().name for c in used_by_brs] 135 already_used = self._columns_used_by_comparisons
File ~\.conda\envs\cstr-env\lib\site-packages\splink\input_column.py:185, in InputColumn.__init__(self, raw_column_name_or_column_reference, settings_obj, sql_dialect) 179 # Handle the case that the column name is a sql keyword like 'group' 180 self.input_name: str = self._quote_if_sql_keyword( 181 raw_column_name_or_column_reference 182 ) 184 self.col_builder: SqlglotColumnTreeBuilder = (--> 185 SqlglotColumnTreeBuilder.from_raw_column_name_or_column_reference( 186 raw_column_name_or_column_reference, 187 sqlglot_dialect=self.sql_dialect, 188 ) 189 )
File ~\.conda\envs\cstr-env\lib\site-packages\splink\input_column.py:109, in SqlglotColumnTreeBuilder.from_raw_column_name_or_column_reference(cls, input_str, sqlglot_dialect) 107 # If the raw string parses to a valid signature, use it 108 try:--> 109 tree = sqlglot.parse_one(input_str, dialect=sqlglot_dialect) 110 except (sqlglot.ParseError, sqlglot.TokenError): 111 pass
File ~\.conda\envs\cstr-env\lib\site-packages\sqlglot\__init__.py:150, in parse_one(sql, read, into, **opts) 148 result = dialect.parse_into(into, sql, **opts) 149 else:--> 150 result = dialect.parse(sql, **opts) 152 for expression in result: 153 if not expression:
File ~\.conda\envs\cstr-env\lib\site-packages\sqlglot\dialects\dialect.py:151, in Dialect.parse(self, sql, **opts) 150 def parse(self, sql: str, **opts) -> t.List[t.Optional[exp.Expression]]:--> 151 return self.parser(**opts).parse(self.tokenize(sql), sql)
File ~\.conda\envs\cstr-env\lib\site-packages\sqlglot\dialects\dialect.py:174, in Dialect.parser(self, **opts) 173 def parser(self, **opts) -> Parser:--> 174 return self.parser_class( # type: ignore 175 **{ 176 "index_offset": self.index_offset, 177 "unnest_column_only": self.unnest_column_only, 178 "alias_post_tablesample": self.alias_post_tablesample, 179 "null_ordering": self.null_ordering,
180 **opts,
181 },
182 )
TypeError: Parser.__init__() got an unexpected keyword argument 'dialect'
OS:
Windows 10
Splink version:
3.9.12
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
Hmmm, looks like error might be originating from sqlglot
- can I check what version of that package you have installed?
Right, had a quick check and looks like this breaks for all sqlglot < 17.1.0
. If you are able to upgrade to any version >=17.1.0
or v18 that will fix the problem, but will look into how we can fix things for older versions
Okay there is a fix in for older versions of sqlglot
in #1996 which I think should solve this issue. Once that is merged you can install the latest version from master
to avoid this, and it will be included in the next release.
Thank you for this @ADBond - updating sqlglot
to v18.17.0 fixed the issue. It also seems that we needed to put the DataFrame inside a list.