replaced_words is not correct
xcTorres opened this issue · comments
address_str = "Perum GPS Griya Permata Sejahtera gang Guyub No 17 Ngumpak Dalem Dander Bojonegoro"
suggestions = sym_spell.lookup_compound(address_str, max_edit_distance=1,ignore_non_words=True, transfer_casing=False)
for sug in suggestions:
print(sug)
# "perum GPS griya permata sejahtera gang muyub no 17 ngumpakdalem dander bojonegoro, 11, 0"
We can see Ngumpak Dalem is changed to ngumpakdalem. But when I print the replaced_words.
for k, v in sym_spell.replaced_words.items():
print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")
origin: guyub, modify: muyub, edit_distance: 1
origin: ngumpak, modify: n tumpak, edit_distance: 2
Seems "origin: ngumpak, modify: n tumpak, edit_distance: 2" is not as expected.
I believe this is because I missed updating replaced_words
when a combination of 2 terms is the best match. I have pushed a fix to this branch. Could you please test and see if that fixes the problem for you?
I have tried it on my side with following code
import pkg_resources
from symspellpy import SymSpell
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "frequency_dictionary_en_82_765.txt"
)
bigram_path = pkg_resources.resource_filename(
"symspellpy", "frequency_bigramdictionary_en_243_342.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)
input_term = (
"whereis th elove GPS hehad dated forImuch of thepast who "
"couqdn'tread in sixtgrade and 16 microstru cture him"
)
suggestions = sym_spell.lookup_compound(
input_term, max_edit_distance=1, ignore_non_words=True
)
for suggestion in suggestions:
print(suggestion)
for k, v in sym_spell.replaced_words.items():
print(f"origin: {k}, modify: {v.term}, edit_distance: {v.distance}")
and managed to get the following output
where is the love GPS he had dated for much of the past who couldn't read in six grade and 16 microstructure him, 9, 0
<omitted>
origin: microstru, modify: microstructure, edit_distance: 1
and it seems to address the issue
Thanks. It works. Could I add one more question? Is there a way to get the start, end index of the origin word?
Unfortunately there's no way to do that in symspellpy right now, you'll have to implement some custom post processing functions in your project for that
Thanks for your reply, and thanks for the package.