[FEAT] Allow fuzzy matches on array-valued columns

Question

[FEAT] Allow fuzzy matches on array-valued columns

samkodes opened this issue 5 months ago · comments

Is your proposal related to a problem?

One typical use of array-valued columns is when one known entity has multiple values for a single attribute, a match on any of which is acceptable. For example one person may have multiple telephone numbers or an address history, or even multiple names (married/unmarried names, aliases, etc). While exploding the records outside of splink and post-processing matches is one alternative to array-valued columns, this affects match probabilities and may create memory challenges.

Currently the only way to compare array-valued columns is to use the array_intersect functions, which require exact matches among the elements of the array.

However, in many circumstances, these values are prone to error and fuzzy matching may be required to identify matches.

Describe the solution you'd like

I propose a family of matching functions of the form array_min_sim(x, y, simtype) and array_max_sim(x,y,simtype). Here x and y are arrays, and "simtype" is a string identifying one of the existing fuzzy string comparison metrics (alternatively, a separate array function could be implemented for each metric).

The semantics would be to return the maximum or minimum similarity (alternatively, distance) between an element of array x and an element of array y. These max/min similarities could be thresholded to get comparison levels.

I'm not sure what the best implementation of this would be. Duckdb may allow an implementation using an iterated call to list_reduce. For example, for array_min_sim, the underlying code could be a list_reduce applied to [Infinity,x] (Infinity prepended to x), with reduce function (a, b) -> min(a, list_reduce( [Infinity,y], (c,d) -> min(c, dist(b,d ) ) .

Alternatively, the arrays could be exploded locally and a full cartesian product could be constructed temporarily (with no other variables to reduce memory overhead), the string comparison run on on the expanded data, then aggregated with min/max; I believe this is similar to what is/was being considered for array valued blocking keys (#1448) (though I can't say I understand the internals enough to understand if this will work!)

Describe alternatives you've considered

Un-nesting the data before running splink and post-processing is one option, but this affects m-values and creates memory challenges via duplicated records.

Using array intersections as another alternative, but it demands exact matching.

Additional context

Sam Kaufman · Answer 1 · Fri Feb 23 2024 10:15:10 GMT+0800 (China Standard Time)

Adding similar requests #1582 and #1337

Robin Linacre · Answer 2 · Fri Feb 23 2024 16:49:45 GMT+0800 (China Standard Time)

Yeah - so we've been experimenting with these new duckdb functions too. We plan to include a function in the comparison library/comparison level library eventually, but for the moment you can do it using custom sql. Here's some pointers if you didn't work it out already:

https://gist.github.com/RobinL/d8a84f7a31fa7cb17dafb05c94518225
https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html?h=custom#method-4-providing-the-spec-as-a-dictionary

Sam Kaufman · Answer 3 · Fri Feb 23 2024 23:54:57 GMT+0800 (China Standard Time)

Thanks, that may be a neater approach. Do you have any idea how to make this work on spark? I see there are similar array functions, and will experiment with some.

ADBond · Answer 4 · Sat Feb 24 2024 00:15:30 GMT+0800 (China Standard Time)

For spark we define a custom udf DualArrayExplode in the included jar which performs the Cartesian product of arrays. So the sort of SQL you would use would be, for example:

ARRAY_MIN(
    TRANSFORM(DUALARRAYEXPLODE(name_l, name_r), x -> LEVENSHTEIN(x['_1'], x['_2']))
) <= 2

Sam Kaufman · Answer 5 · Sat Feb 24 2024 01:51:24 GMT+0800 (China Standard Time)

Thanks. Following Robin's sample code, I think udf's can be avoided with an iterated transform to do the cartesian product (a little easier to understand than my iterated reduce proposal, though it involves one extra array operation and possibly intermediate storage). The approach can be put together as something like this:

ARRAY_MIN( 
     TRANSFORM(      
                    FLATTEN( TRANSFORM(name_l, a -> TRANSFORM( name_r, b -> [a,b] ) )) -- make list of pairs 
                    , p -> LEVENSHTEIN(p[1],p[2]) )  -- calc distance between elements of each pair
) -- take minimum

Zeb Burke-Conte · Answer 6 · Sat Mar 02 2024 06:06:15 GMT+0800 (China Standard Time)

I'm planning to work on a PR for this!