IndexError: List index out of range when calling linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

Question

IndexError: List index out of range when calling linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

sthamodh opened this issue 4 months ago · comments

sthamodh commented 4 months ago

What happens?

Hi,

The following code is picked up from the example page here: spark example

I have had this code work many times before without any issues.

The error I get is from the expectation maximization step as shown below

I did a little snooping around and was able to trace back to the step below and when I ran this piece of code after the SparkLinker step, I get the same error as shown in the screenshots below.

linker._settings_obj._get_comparison_levels_corresponding_to_training_blocking_rule("l.first_name = r.first_name and l.surname = r.surname")

I have no idea why this would happen and all of the developed code from my solution (deduplicating corporate addresses) also fails in this step.

To Reproduce

I used the spark example
I ran it on a single user cluster on databricks with 12.2 LTS ML as the runtime. Here is a screenshot of the cluster configuration:

OS:

Databricks runtime version: 12.2 LTS ML (includes Apache Spark 3.3.2, Scala 2.12)

Splink version:

3.9.13

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

ADBond · Answer 1 · Wed Mar 20 2024 09:48:09 GMT+0800 (China Standard Time)

See this comment

Robin Linacre · Answer 2 · Tue Mar 26 2024 14:31:46 GMT+0800 (China Standard Time)

Closed by #2079

IndexError: List index out of range when calling linker.estimate_parameters_using_expectation_maximisation(training_blocking_rule)

What happens?

To Reproduce

OS:

Splink version:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Have you tried this on the latest `master` branch?