moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sql syntax error:

lamaeldo opened this issue · comments

What happens?

When running training sessions of M, I get the following error:

splink.exceptions.SplinkException: Error executing the following sql for table __splink__m_u_counts(__splink__m_u_counts_f8b550e08):

    CREATE TABLE __splink__m_u_counts_f8b550e08
    AS
    (WITH __splink__df_comparison_vectors as (select * from __splink__df_comparison_vectors_2b87b76bf),

__splink__df_match_weight_parts as (
select "source_dataset_l","source_dataset_r","unique_id_l","unique_id_r",match_key
from __splink__df_comparison_vectors
),
__splink__df_predict as (
select
log2(cast(0.0006382379750372257 as float8) * ) as match_weight,
CASE WHEN THEN 1.0 ELSE (cast(0.0006382379750372257 as float8) * )/(1+(cast(0.0006382379750372257 as float8) * )) END as match_probability,
"source_dataset_l","source_dataset_r","unique_id_l","unique_id_r",match_key
from __splink__df_match_weight_parts

order by 1
)
select 0 as comparison_vector_value,
       sum(match_probability * 1) /
           sum(1) as m_count,
       sum((1-match_probability) * 1) /
           sum(1) as u_count,
       '_probability_two_random_records_match' as output_column_name
from __splink__df_predict
)

Error was: Parser Error: syntax error at or near ")"

It seems like there's a term missing in the calculation of ther match_weight

To Reproduce

I am using Splink 3.9.14, DuckDB 10.2. I was actually debugging another issue with performance (basically taking 10mins for inference over ~40m comparisons when i expected to be able to do around a billion in that amount of time following the benchmarks. I have done some edits to splink's code, so i uninstalled and installed the package again
I have a list of rules, and I loop over these doing linker.estimate_parameters_using_expectation_maximisation(rule) . Full notebook in attachment
error_reproduce.ipynb.txt

OS:

WIndows 11

Splink version:

3.9.14

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Fixed randomly on my end with seemingly no change, so i assume you won't be able to reproduce it