wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

Home Page:https://seekstorm.com/blog/1000x-spelling-correction/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(using MySQL) How do I efficiently limit resulting dictionary candidates to only those with DamLev distance <=2?

rockroland opened this issue · comments

My question is a continuation of an earlier question you answered for me on Jan 10, 2021.

(For reference, in my implementation I am storing the pre-calculated dictionary in a MySQL memory table with a BTree index. The largest dictionary I use has 450,000,000 rows and uses 15GB of memory yet is very very fast.)

My specific question is that I need a clarification of what you documented here:


Remark 2: There are four different comparison pair types:

  1. dictionary entry==input entry,
  2. delete(dictionary entry,p1)==input entry
  3. dictionary entry==delete(input entry,p2)
  4. delete(dictionary entry,p1)==delete(input entry,p2)

The last comparison type is required for replaces and transposes only. But we need to check whether the suggested dictionary term is really a replace or an adjacent transpose of the input term to prevent false positives of higher edit distance (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn).


Comparison types 1,2 and 3 are clear and obviously produce dictionary matches with a Levenshtein distance of 2 or less.

Comparison type 4 is where my question arises.

MY DATA:
If we assume that my "pre-calculated dictionary" has (a) dictionary words with frequency, (b) all dictionary words with 1 letter recursively deleted, and (c) all recursively deleted dictionary terms which each then have 1 more letter recursively deleted

AND

I have my "input" comprised of (a) my input entry term (the mispelled term for which I seek dictionary suggestions), (b) my input entry term with 1 letter recursively deleted, and (c) all recursively deleted input entry terms which each then have 1 more letter recursively deleted

I can then look for matches between these two datasets ("pre-calculated dictionary" ,"input") .

But it is clear that some matches between "pre-calculated dictionary" item (c) and "input" item (c) will produce results with a Levenshtein distance greater than 2 and I am unsure of how to efficiently eliminate these matches with high edit distance.

For example,
if my "input" term is "RESHAV" then one of elements of "input" item (c) would be "ESAV" (R and H are deleted) ...
and if one of my dictionary words is "EKASAV" then one element of my "pre-calculated dictionary" (c) would be "ESAV" (K and first A are deleted)

This would appear to be naive match "ESAV"="ESAV" but the Lev distance between "RESHAV" and "EKASAV" is 4.

So my question is: do I need to calculate the Lev distance for each dictionary suggestion arising from Comparison type 4 in order to eliminate results with edit distance > 2?

Or, is there a clever way to limit comparisons between "pre-calculated dictionary" (c) elements and "input" (c) elements by bucketing such that the resulting suggestions can't have a Lev distance above 2? (...thus avoiding a Lev calculation on each result)

I considered limiting my comparison to "pre-calculated dictionary" (c) elements where the position of one of the deleted letters was the same position as one of the deleted letters from the "input" (c) elements but I found that this idea produced some invalid results with Lev distance 3 and also excluded a few valid results with a Lev distance 2.

I'm not a trained programmer per se so I have not attempted to read your code but I am now going to try that to see if it sheds light on the best method. In the meantime I thought you might have a ready solution that you could describe in words as to what you more specifically mean in your comment here:


"The last comparison type is required for replaces and transposes only. But we need to check whether the suggested dictionary term is really a replace or an adjacent transpose of the input term to prevent false positives of higher edit distance (bank==bnak and bank==bink, but bank!=kanb and bank!=xban and bank!=baxn)."


Thank you in advance for your time and advice and for this remarkably effective idea. -rockroland

Generally, we have to calculate the Levenshtein distance ONLY for candidates that were found when deleting chars both on the input term, AND the dictionary term (deleting on both sides is required for replaces [kar->car] and adjacent transposes [acr->car] only).

Additionally, we can discard a candidate WITHOUT calculating the Levenshtein distance, if the delete positions in input term and dictionary term were on different sides of the resulting delete stem, and the number of deleted chars on both sides > maximum edit distance. But this requires that we store the delete positions together with the deletes.

Example:


[a]re
   re[do]

The sum of deleted chars is 3, the maximum edit distance is 2, the delete positions are on different sides of the stem, so we can discard the suggestion without calculating the Levenshtein distance.

SymSpell contains additional optimizations (discard candidates without Levenshtein calculation) for the case when we don't need all suggestions below a maximum edit distance (Verbosity.All), but only those candidates with the lowest edit distance (Verbosity.Top Verbosity.Closest).

That makes sense. Thanks for the clear answer.. now I can proceed with confidence. Best Regards