mammothb / symspellpy

Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to prevent correction based only on frequency?

mzeidhassan opened this issue · comments

Hi @mammothb ,

I hope all is well with you.

I am having hard time figuring out how to make symspell choose the closest term that existss in the dictionary.

For example:

Correct الأصبع to الأربع, 1, 10011

So, it fixes الأصبع to a very different word which is الأربع, although the correct word exists in the dictionary which is:

الإصبع

الإصبع 498

As you can see, the second correct word has lower frequency, but Symspell chooses the other word with higher frequency.

Is there anyway I can fix this?

Thanks in advance for your support!

commented

I think we've had a similar discussion about this issue before.

Is this similar to wanting sream to be corrected to steam instead of stream because it fits the context of the text? If so, I don't think it's possible since the package cannot choose a correction based on context.

https://github.com/bakwc/JamSpell - This guy promises that his spelling correction algorithm uses context (I didn't try it yet)

Hi @mammothb
sorry for my belated reply. For some reason, I was not notified when you answered.

No, I don't want to fix 'sream' to be 'steam' instead of 'stream'.

I am not sure how to explain this, but I will try:

The misspelled word here is:
الأصبع

The desired output should be "الإصبع "
It's basically the same exact word, and the only difference is in the alef hamza characters

I mean أ vs إ

Symspell decided to replace the misspelled word with a totally different word "الأربع", although the closest word exists in the dictionary. I hope the attached image can explain it better.

Arabic is a right to left language, in case this makes a difference.

Thanks again!

image

Thanks @frutik for your reply. Making Symspellpy context-aware would be awesome, but it will require NN integration indeed.

I was testing another word:
الاغذية which should be corrected to الأغذية
but the dictionary has also التغذية with higher frequency.

5996 vs 1730

So, Symspellpy chose التغذية because it has higher frequency.

Changing its frequency from 5996 to match the same frequency of the desired one "الأغذية", seems to fix the issue. Since both now have the same frequency '1730', symspellpy somehow chose eventually the desired one. So, frequency plays a role here. Not sure how to get around it though. Any idea?

commented

In your example with الأصبع, both the desired and symspellpy has an edit distance of 1 right? symspellpy does not take into account how "close" the wrong character is, e.g., "i" is closer to "l" than "n". Am I right to compare your example to symspellpy correcting "siip" to "snip" instead of "slip"?

Maybe you can build a mapping of similar characters and various weights, such as "l" to "i" is 0.8 and "l" to "k" is 0.5, and use that to select the results returned by symspellpy

Thanks @mammothb for your reply. Yes, both have edit distance of 1.

As for your example:

"Am I right to compare your example to symspellpy correcting "siip" to "snip" instead of "slip""

Not really if we use the second example "الاغذية، التغذية". In Arabic, the letter أ comes before ت, yet Symspellpy still picks the bit distant one, but once both has the same frequency number, it gets it right.

If you have any sample code for such mapping to share, it would be awesome.

One last question:

Any idea why when the frequency count is the same, Symspellpy picks the right suggestion?
Which comes first when it comes to picking suggestions; Edit Distance or Frequency?

Thanks

commented

May I know what do you mean by close and distant? Were you not referring to how closely they resemble each other visually? Or did you mean in terms of alphabetical order, e.g., 'a' closer to 'b' than 'd'?

I meant the second one. The letter (أ = sounds A in Arabic) comes before the letter (ت = sounds T in Arabic).

commented

When sorting the suggestions, we only look at edit distance and frequency (code). So the alphabetical order of the suggestions is not considered.

Thanks @mammothb for your reply. I will need to rethink about how I can address this issue. I may come back to this issue later.

Hi @mammothb
I came across this issue now. Is there something wrong with the edit distance calculation somewhere?

I have this word:

الكترون
Symspell suggests to change it to
الكرتون (edit distance of 2)

While the nearest word with the closest edit distance is there already in the dictionary, which is:
إلكترون (edit distance of 1)

I used this online Levenshtein Distance tool to verify quickly.

Any idea?

Thanks

commented

Right now, symspellpy uses DamerauOSA by default (code) which gives an edit distance of 1 for both. I have not implemented a way to change the edit distance algorithm when you create the SymSpell object but you can overwrite it like so

from symspellpy import SymSpell
from symspellpy.editdistance import DistanceAlgorithm

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
sym_spell._distance_algorithm = DistanceAlgorithm.LEVENSHTEIN

Thank you so much @mammothb for your support. I will test this code and let you know. I hope it will generate better results than DamerauOSA.

Hi @mammothb
Just wanted to confirm that the piece of code above works fine. I am closing the issue.
Thanks