wolfgarbe / SymSpell

SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

Home Page:https://seekstorm.com/blog/1000x-spelling-correction/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LookupCompound return count=0

mammothb opened this issue · comments

commented

I believe LookupCompound returns count as a sum of individual suggestion parts in previous version. But the current version returns count as 0. I have tested the following inputs from README

whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
in te dhird qarter oflast jear he hadlearned ofca sekretplan
the bigjest playrs in te strogsommer film slatew ith plety of funn
Can yu readthis messa ge despite thehorible sppelingmsitakes

I get count=0 for all of them. I have tried with both monogram+bigram and monogram dictionary only. May I know if this is a bug or an intended change?

First of all, thank you very much for your great work with your Python port of SymSpell!

In the previous version the returned count was not the sum of individual suggestion parts, but the Math.Min() of the individual suggestion parts:
string s = ""; foreach (SuggestItem si in suggestionParts) { s += si.term + " "; suggestion.count = Math.Min(suggestion.count, si.count); }

Now the returned count is calculated using the the Naive Bayes probability of the individual suggestion parts:
double count = SymSpell.N; foreach (SuggestItem si in suggestionParts) { s.Append(si.term + " "); count *= (double)si.count / (double)SymSpell.N; } suggestion.count = (long)count;

The idea behind this is that the count should reflect the term frequency of the suggestion in the dictionary. For a single term this is easy, we just look up the term in the dictionary and find the count in the frequency dictionary. For multiple terms/phrases this is more difficult, as we don't have counts for them in the single term frequency dictionary. So we calculate the (hypothetical) frequency using the Naive Bayes probability. For long term combinations the probability (count) diverges to zero.

commented

I see. Thank you for the clarification, I am able to get non-zero count values when I use shorter input phrases.