ReceiptManager / receipt-parser-legacy

A supermarket receipt parser written in Python using tesseract OCR

Home Page:https://tech.trivago.com/2015/10/06/python_receipt_parser/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fix failing unit test for market name

mre opened this issue · comments

In #23 (comment), @kiwita88 found the likely reason why our unit test for the market name 'p e n ny' fails. We should fix that.

commented

difflib is what is currently in use for this buggy feature. The diffing algorithm used by difflib is called Ratcliff-Obershelp and seems to be generic in regards to data type (binary data, strings, etc.). There are better algorithms for determining fuzzy string similarity such as Levenshtein. I believe switching algorithms is the best solution here.

What do you think, @mre? I could be convinced to write up a PR if no other contributor can. If you're comfortable adding a dependency, it might make sense to lean on https://github.com/seatgeek/fuzzywuzzy for this too.

commented

Update: the existing packages I mentioned are GPLv2 licensed which may not be desired so perhaps just a direct implementation of the Levenshtein algorithm could be added for this feature. Plenty of inspiration is available.

Hey @rayrr,
thanks for your input. Yes, switching to Levenshtein would be worth a try. Whether we use a library or not doesn't matter to me. Also GPLv2 is fine in my book.
So if you like and you find the time, please go ahead and whip up a PR for this. 👍