wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

Home Page:https://seekstorm.com/blog/1000x-spelling-correction/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Capable of Similarity checks ?

nhaberl opened this issue · comments

I am using JARO Winkler distance algo for checking if a given string is similar to another.
Based on the score I show up results.

Now the question is if SymSpell is capable of doing this better / more performant?

var score = symspell.CalculateSimilarity(searchText, textFromDatabase)

SymSpell is using the Damerau-Levenshtein edit distance algorithm to calculate string similarity. Damerau-Levenshtein is not faster than Jaro-Winkler for basic string similarity comparison (https://stackoverflow.com/a/25540945/1824135), and SymSpell doesn't change that.

SymSpell shines, when you want to find very fast the most similar terms compared to an input string from a large large dictionary. So not the single term vs. term comparison is faster, but a SymSpell lookup in a large dictionary is much faster than the naive approach of comparing all n dictionary terms sequentially with the input term, and also much faster than Norvigs approach (generating many similar candidates from the input string, which are then looked up in the dictionary).

If similarity check against a large dictionary is your use case, you should give SymSpell a try. You can even take the returned top-k candidates, calculate the Jaro-Winkler distance for them (which is fast because the number of candidates pre-filtered by Symspell is much smaller then the number of terms in the original dictionary), and reorder the candidates according to the Jaro-Winkler score.