mammothb / symspellpy

Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

replaced_words is not correct

xcTorres opened this issue · comments

address_str = "Perum GPS Griya Permata Sejahtera gang Guyub No 17 Ngumpak Dalem Dander Bojonegoro"

suggestions = sym_spell.lookup_compound(address_str, max_edit_distance=1,ignore_non_words=True, transfer_casing=False)

for sug in suggestions:
    print(sug)

# "perum GPS griya permata sejahtera gang muyub no 17 ngumpakdalem dander bojonegoro, 11, 0" 

We can see Ngumpak Dalem is changed to ngumpakdalem. But when I print the replaced_words.

for k, v in sym_spell.replaced_words.items():
    print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")

origin: guyub, modify: muyub, edit_distance: 1
origin: ngumpak, modify: n tumpak, edit_distance: 2

Seems "origin: ngumpak, modify: n tumpak, edit_distance: 2" is not as expected.

commented

I believe this is because I missed updating replaced_words when a combination of 2 terms is the best match. I have pushed a fix to this branch. Could you please test and see if that fixes the problem for you?

I have tried it on my side with following code

import pkg_resources

from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

input_term = (
    "whereis th elove GPS hehad dated forImuch of thepast who "
    "couqdn'tread in sixtgrade and 16 microstru cture him"
)
suggestions = sym_spell.lookup_compound(
    input_term, max_edit_distance=1, ignore_non_words=True
)
for suggestion in suggestions:
    print(suggestion)

for k, v in sym_spell.replaced_words.items():
    print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")

and managed to get the following output

where is the love GPS he had dated for much of the past who couldn't read in six grade and 16 microstructure him, 9, 0
<omitted>
origin: microstru, modify: microstructure, edit_distance: 1

and it seems to address the issue

Thanks. It works. Could I add one more question? Is there a way to get the start, end index of the origin word?

commented

Unfortunately there's no way to do that in symspellpy right now, you'll have to implement some custom post processing functions in your project for that

Thanks for your reply, and thanks for the package.