moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deduplication doesn't work `TypeError: Parser.__init__() got an unexpected keyword argument 'dialect'`

James-Osmond opened this issue · comments

What happens?

My colleague @Joe-Pawley and I have been trying without success to run a basic determinstic deduplication on a dataset. This exact code used to run when we were using Splink v3.9.0, but now that we have upgraded to 3.9.12, the same code does not run. I provide the code below. The dataset df has columns for bothinitials, sex, dob, and sector, as well as a unique_id column.

I checked the demos in this directory and cannot see what we are doing so differently that would be breaking the code. Do you have any idea what might be causing this error/ have you seen it before?

To Reproduce

from splink.duckdb.linker import DuckDBLinker

deterministic_brs = [
    # Non null match on both initials
    "l.bothinitials = r.bothinitials AND "
    # Non null match on date of birth
    "l.dob = r.dob AND "
    # Non null match on sex
    "l.sex = r.sex AND "
    # Non null match on postcode sector
    "l.sector = r.sector"
    ]

deterministic_settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": deterministic_brs
}
deterministic_linker = DuckDBLinker(
    df, deterministic_settings
)

produces the following error:

---------------------------------------------------------------------------TypeError                                 Traceback (most recent call last)

Cell In[7], line 5      1 deterministic_settings = {      2     "link_type": "dedupe_only",      3     "blocking_rules_to_generate_predictions": deterministic_brs      4 }----> 5 deterministic_linker = DuckDBLinker(      6     df, deterministic_settings      7 )


File ~\.conda\envs\cstr-env\lib\site-packages\splink\duckdb\linker.py:180, in DuckDBLinker.__init__(self, input_table_or_tables, settings_dict, connection, set_up_basic_logging, output_schema, input_table_aliases, validate_settings)    177 except ImportError:    178     pass--> 180 super().__init__(    181     input_tables,    182     settings_dict,    183     accepted_df_dtypes,    184     set_up_basic_logging,    185     input_table_aliases=input_aliases,    186     validate_settings=validate_settings,    187 )    189 # Quickly check for casting error in duckdb/pandas    190 for i, (table, alias) in enumerate(zip(input_tables, input_aliases)):


File ~\.conda\envs\cstr-env\lib\site-packages\splink\linker.py:237, in Linker.__init__(self, input_table_or_tables, settings_dict, accepted_df_dtypes, set_up_basic_logging, input_table_aliases, validate_settings)    235     self._validate_settings_components(settings_dict)    236     settings_dict = deepcopy(settings_dict)--> 237     self._setup_settings_objs(settings_dict)    239 homogenised_tables, homogenised_aliases = self._register_input_tables(    240     input_table_or_tables,    241     input_table_aliases,    242     accepted_df_dtypes,    243 )    245 self._input_tables_dict = self._get_input_tables_dict(    246     homogenised_tables, homogenised_aliases    247 )


File ~\.conda\envs\cstr-env\lib\site-packages\splink\linker.py:498, in Linker._setup_settings_objs(self, settings_dict)    496     self._settings_obj_ = None    497 else:--> 498     self._settings_obj_ = Settings(settings_dict)

File ~\.conda\envs\cstr-env\lib\site-packages\splink\settings.py:80, in Settings.__init__(self, settings_dict)     76 self._warn_if_no_null_level_in_comparisons()     78 self._additional_cols_to_retain = self._get_raw_additional_cols_to_retain     79 self._additional_columns_to_retain_list = (---> 80     self._get_additional_columns_to_retain()     81 )


File ~\.conda\envs\cstr-env\lib\site-packages\splink\settings.py:132, in Settings._get_additional_columns_to_retain(self)    127 for br in self._blocking_rules_to_generate_predictions:    128     used_by_brs.extend(    129         get_columns_used_from_sql(br.blocking_rule_sql, br.sql_dialect)    130     )--> 132 used_by_brs = [InputColumn(c) for c in used_by_brs]    134 used_by_brs = [c.unquote().name for c in used_by_brs]    135 already_used = self._columns_used_by_comparisons


File ~\.conda\envs\cstr-env\lib\site-packages\splink\settings.py:132, in <listcomp>(.0)    127 for br in self._blocking_rules_to_generate_predictions:    128     used_by_brs.extend(    129         get_columns_used_from_sql(br.blocking_rule_sql, br.sql_dialect)    130     )--> 132 used_by_brs = [InputColumn(c) for c in used_by_brs]    134 used_by_brs = [c.unquote().name for c in used_by_brs]    135 already_used = self._columns_used_by_comparisons


File ~\.conda\envs\cstr-env\lib\site-packages\splink\input_column.py:185, in InputColumn.__init__(self, raw_column_name_or_column_reference, settings_obj, sql_dialect)    179 # Handle the case that the column name is a sql keyword like 'group'    180 self.input_name: str = self._quote_if_sql_keyword(    181     raw_column_name_or_column_reference    182 )    184 self.col_builder: SqlglotColumnTreeBuilder = (--> 185     SqlglotColumnTreeBuilder.from_raw_column_name_or_column_reference(    186         raw_column_name_or_column_reference,    187         sqlglot_dialect=self.sql_dialect,    188     )    189 )


File ~\.conda\envs\cstr-env\lib\site-packages\splink\input_column.py:109, in SqlglotColumnTreeBuilder.from_raw_column_name_or_column_reference(cls, input_str, sqlglot_dialect)    107 # If the raw string parses to a valid signature, use it    108 try:--> 109     tree = sqlglot.parse_one(input_str, dialect=sqlglot_dialect)    110 except (sqlglot.ParseError, sqlglot.TokenError):    111     pass

File ~\.conda\envs\cstr-env\lib\site-packages\sqlglot\__init__.py:150, in parse_one(sql, read, into, **opts)    148     result = dialect.parse_into(into, sql, **opts)    149 else:--> 150     result = dialect.parse(sql, **opts)    152 for expression in result:    153     if not expression:


File ~\.conda\envs\cstr-env\lib\site-packages\sqlglot\dialects\dialect.py:151, in Dialect.parse(self, sql, **opts)    150 def parse(self, sql: str, **opts) -> t.List[t.Optional[exp.Expression]]:--> 151     return self.parser(**opts).parse(self.tokenize(sql), sql)


File ~\.conda\envs\cstr-env\lib\site-packages\sqlglot\dialects\dialect.py:174, in Dialect.parser(self, **opts)    173 def parser(self, **opts) -> Parser:--> 174     return self.parser_class(  # type: ignore    175         **{    176             "index_offset": self.index_offset,    177             "unnest_column_only": self.unnest_column_only,    178             "alias_post_tablesample": self.alias_post_tablesample,    179             "null_ordering": self.null_ordering,

    180             **opts,

    181         },

    182     )


TypeError: Parser.__init__() got an unexpected keyword argument 'dialect'

OS:

Windows 10

Splink version:

3.9.12

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Hmmm, looks like error might be originating from sqlglot - can I check what version of that package you have installed?

Right, had a quick check and looks like this breaks for all sqlglot < 17.1.0. If you are able to upgrade to any version >=17.1.0 or v18 that will fix the problem, but will look into how we can fix things for older versions

Okay there is a fix in for older versions of sqlglot in #1996 which I think should solve this issue. Once that is merged you can install the latest version from master to avoid this, and it will be included in the next release.

Thank you for this @ADBond - updating sqlglot to v18.17.0 fixed the issue. It also seems that we needed to put the DataFrame inside a list.