How to prevent correction based only on frequency?

Question

How to prevent correction based only on frequency?

mzeidhassan opened this issue 4 years ago · comments

Mohamed Zeid commented 4 years ago

Hi @mammothb ,

I hope all is well with you.

I am having hard time figuring out how to make symspell choose the closest term that existss in the dictionary.

For example:

Correct الأصبع to الأربع, 1, 10011

So, it fixes الأصبع to a very different word which is الأربع, although the correct word exists in the dictionary which is:

الإصبع

الإصبع 498

As you can see, the second correct word has lower frequency, but Symspell chooses the other word with higher frequency.

Is there anyway I can fix this?

Thanks in advance for your support!

mmb L · Answer 1 · Thu Jan 09 2020 16:10:04 GMT+0800 (China Standard Time)

I think we've had a similar discussion about this issue before.

Is this similar to wanting sream to be corrected to steam instead of stream because it fits the context of the text? If so, I don't think it's possible since the package cannot choose a correction based on context.

Andrew Kornilov · Answer 2 · Thu Jan 09 2020 16:16:01 GMT+0800 (China Standard Time)

https://github.com/bakwc/JamSpell - This guy promises that his spelling correction algorithm uses context (I didn't try it yet)

Mohamed Zeid · Answer 3 · Thu Jan 16 2020 07:03:40 GMT+0800 (China Standard Time)

Hi @mammothb
sorry for my belated reply. For some reason, I was not notified when you answered.

No, I don't want to fix 'sream' to be 'steam' instead of 'stream'.

I am not sure how to explain this, but I will try:

The misspelled word here is:
الأصبع

The desired output should be "الإصبع "
It's basically the same exact word, and the only difference is in the alef hamza characters

I mean أ vs إ

Symspell decided to replace the misspelled word with a totally different word "الأربع", although the closest word exists in the dictionary. I hope the attached image can explain it better.

Arabic is a right to left language, in case this makes a difference.

Thanks again!

Mohamed Zeid · Answer 4 · Thu Jan 16 2020 07:10:37 GMT+0800 (China Standard Time)

Thanks @frutik for your reply. Making Symspellpy context-aware would be awesome, but it will require NN integration indeed.

Mohamed Zeid · Answer 5 · Thu Jan 16 2020 07:54:55 GMT+0800 (China Standard Time)

I was testing another word:
الاغذية which should be corrected to الأغذية
but the dictionary has also التغذية with higher frequency.

5996 vs 1730

So, Symspellpy chose التغذية because it has higher frequency.

Changing its frequency from 5996 to match the same frequency of the desired one "الأغذية", seems to fix the issue. Since both now have the same frequency '1730', symspellpy somehow chose eventually the desired one. So, frequency plays a role here. Not sure how to get around it though. Any idea?

mmb L · Answer 6 · Fri Jan 17 2020 08:35:46 GMT+0800 (China Standard Time)

In your example with الأصبع, both the desired and symspellpy has an edit distance of 1 right? symspellpy does not take into account how "close" the wrong character is, e.g., "i" is closer to "l" than "n". Am I right to compare your example to symspellpy correcting "siip" to "snip" instead of "slip"?

Maybe you can build a mapping of similar characters and various weights, such as "l" to "i" is 0.8 and "l" to "k" is 0.5, and use that to select the results returned by symspellpy

Mohamed Zeid · Answer 7 · Sat Jan 18 2020 00:46:33 GMT+0800 (China Standard Time)

Thanks @mammothb for your reply. Yes, both have edit distance of 1.

As for your example:

"Am I right to compare your example to symspellpy correcting "siip" to "snip" instead of "slip""

Not really if we use the second example "الاغذية، التغذية". In Arabic, the letter أ comes before ت, yet Symspellpy still picks the bit distant one, but once both has the same frequency number, it gets it right.

If you have any sample code for such mapping to share, it would be awesome.

One last question:

Any idea why when the frequency count is the same, Symspellpy picks the right suggestion?
Which comes first when it comes to picking suggestions; Edit Distance or Frequency?

Thanks

mmb L · Answer 8 · Sat Jan 18 2020 15:03:26 GMT+0800 (China Standard Time)

May I know what do you mean by close and distant? Were you not referring to how closely they resemble each other visually? Or did you mean in terms of alphabetical order, e.g., 'a' closer to 'b' than 'd'?

Mohamed Zeid · Answer 9 · Sun Jan 19 2020 02:25:28 GMT+0800 (China Standard Time)

I meant the second one. The letter (أ = sounds A in Arabic) comes before the letter (ت = sounds T in Arabic).

mmb L · Answer 10 · Tue Jan 21 2020 18:40:15 GMT+0800 (China Standard Time)

When sorting the suggestions, we only look at edit distance and frequency (code). So the alphabetical order of the suggestions is not considered.

Mohamed Zeid · Answer 11 · Thu Jan 23 2020 06:36:57 GMT+0800 (China Standard Time)

Thanks @mammothb for your reply. I will need to rethink about how I can address this issue. I may come back to this issue later.

Mohamed Zeid · Answer 12 · Sat Feb 08 2020 06:55:53 GMT+0800 (China Standard Time)

Hi @mammothb
I came across this issue now. Is there something wrong with the edit distance calculation somewhere?

I have this word:

الكترون
Symspell suggests to change it to
الكرتون (edit distance of 2)

While the nearest word with the closest edit distance is there already in the dictionary, which is:
إلكترون (edit distance of 1)

I used this online Levenshtein Distance tool to verify quickly.

Any idea?

Thanks

mmb L · Answer 13 · Sat Feb 08 2020 10:35:54 GMT+0800 (China Standard Time)

Right now, symspellpy uses DamerauOSA by default (code) which gives an edit distance of 1 for both. I have not implemented a way to change the edit distance algorithm when you create the SymSpell object but you can overwrite it like so

from symspellpy import SymSpell
from symspellpy.editdistance import DistanceAlgorithm

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
sym_spell._distance_algorithm = DistanceAlgorithm.LEVENSHTEIN

Mohamed Zeid · Answer 14 · Sun Feb 09 2020 10:47:10 GMT+0800 (China Standard Time)

Thank you so much @mammothb for your support. I will test this code and let you know. I hope it will generate better results than DamerauOSA.

Mohamed Zeid · Answer 15 · Wed Apr 08 2020 12:31:59 GMT+0800 (China Standard Time)

Hi @mammothb
Just wanted to confirm that the piece of code above works fine. I am closing the issue.
Thanks