replaced_words is not correct

Question

replaced_words is not correct

xcTorres opened this issue 3 years ago · comments

address_str = "Perum GPS Griya Permata Sejahtera gang Guyub No 17 Ngumpak Dalem Dander Bojonegoro"

suggestions = sym_spell.lookup_compound(address_str, max_edit_distance=1,ignore_non_words=True, transfer_casing=False)

for sug in suggestions:
    print(sug)

# "perum GPS griya permata sejahtera gang muyub no 17 ngumpakdalem dander bojonegoro, 11, 0"

We can see Ngumpak Dalem is changed to ngumpakdalem. But when I print the replaced_words.

for k, v in sym_spell.replaced_words.items():
    print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")

origin: guyub, modify: muyub, edit_distance: 1
origin: ngumpak, modify: n tumpak, edit_distance: 2

Seems "origin: ngumpak, modify: n tumpak, edit_distance: 2" is not as expected.

mmb L · Answer 1 · Tue Nov 30 2021 19:11:41 GMT+0800 (China Standard Time)

I believe this is because I missed updating replaced_words when a combination of 2 terms is the best match. I have pushed a fix to this branch. Could you please test and see if that fixes the problem for you?

I have tried it on my side with following code

import pkg_resources

from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

input_term = (
    "whereis th elove GPS hehad dated forImuch of thepast who "
    "couqdn'tread in sixtgrade and 16 microstru cture him"
)
suggestions = sym_spell.lookup_compound(
    input_term, max_edit_distance=1, ignore_non_words=True
)
for suggestion in suggestions:
    print(suggestion)

for k, v in sym_spell.replaced_words.items():
    print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")

and managed to get the following output

where is the love GPS he had dated for much of the past who couldn't read in six grade and 16 microstructure him, 9, 0
<omitted>
origin: microstru, modify: microstructure, edit_distance: 1

and it seems to address the issue

Xie Chong · Answer 2 · Wed Dec 01 2021 15:03:11 GMT+0800 (China Standard Time)

Thanks. It works. Could I add one more question? Is there a way to get the start, end index of the origin word?

mmb L · Answer 3 · Wed Dec 01 2021 19:10:47 GMT+0800 (China Standard Time)

Unfortunately there's no way to do that in symspellpy right now, you'll have to implement some custom post processing functions in your project for that

Xie Chong · Answer 4 · Thu Dec 02 2021 01:43:20 GMT+0800 (China Standard Time)

Thanks for your reply, and thanks for the package.