moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Home Page:https://moj-analytical-services.github.io/splink/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

IndexError: List index out of range when calling linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

sthamodh opened this issue · comments

What happens?

Hi,

The following code is picked up from the example page here: spark example

I have had this code work many times before without any issues.

The error I get is from the expectation maximization step as shown below

list_index_error_part1

I did a little snooping around and was able to trace back to the step below and when I ran this piece of code after the SparkLinker step, I get the same error as shown in the screenshots below.

linker._settings_obj._get_comparison_levels_corresponding_to_training_blocking_rule("l.first_name = r.first_name and l.surname = r.surname")

list_index_error_part2

list_index_error_part3

I have no idea why this would happen and all of the developed code from my solution (deduplicating corporate addresses) also fails in this step.

To Reproduce

  1. I used the spark example
  2. I ran it on a single user cluster on databricks with 12.2 LTS ML as the runtime. Here is a screenshot of the cluster configuration:

cluster_configuration

OS:

Databricks runtime version: 12.2 LTS ML (includes Apache Spark 3.3.2, Scala 2.12)

Splink version:

3.9.13

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree

Closed by #2079