LookupCompound return count=0

Question

LookupCompound return count=0

mammothb opened this issue 5 years ago · comments

I believe LookupCompound returns count as a sum of individual suggestion parts in previous version. But the current version returns count as 0. I have tested the following inputs from README

whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
in te dhird qarter oflast jear he hadlearned ofca sekretplan
the bigjest playrs in te strogsommer film slatew ith plety of funn
Can yu readthis messa ge despite thehorible sppelingmsitakes

I get count=0 for all of them. I have tried with both monogram+bigram and monogram dictionary only. May I know if this is a bug or an intended change?

Wolf Garbe · Answer 1 · Fri Sep 20 2019 16:27:45 GMT+0800 (China Standard Time)

First of all, thank you very much for your great work with your Python port of SymSpell!

In the previous version the returned count was not the sum of individual suggestion parts, but the Math.Min() of the individual suggestion parts:
string s = ""; foreach (SuggestItem si in suggestionParts) { s += si.term + " "; suggestion.count = Math.Min(suggestion.count, si.count); }

Now the returned count is calculated using the the Naive Bayes probability of the individual suggestion parts:
double count = SymSpell.N; foreach (SuggestItem si in suggestionParts) { s.Append(si.term + " "); count *= (double)si.count / (double)SymSpell.N; } suggestion.count = (long)count;

The idea behind this is that the count should reflect the term frequency of the suggestion in the dictionary. For a single term this is easy, we just look up the term in the dictionary and find the count in the frequency dictionary. For multiple terms/phrases this is more difficult, as we don't have counts for them in the single term frequency dictionary. So we calculate the (hypothetical) frequency using the Naive Bayes probability. For long term combinations the probability (count) diverges to zero.

mmb L · Answer 2 · Fri Sep 20 2019 16:41:12 GMT+0800 (China Standard Time)

I see. Thank you for the clarification, I am able to get non-zero count values when I use shorter input phrases.