ReceiptManager / receipt-parser-legacy

A supermarket receipt parser written in Python using tesseract OCR

Home Page:https://tech.trivago.com/2015/10/06/python_receipt_parser/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Explaination about sum_format and date_format?

sagarhukkire opened this issue · comments

Hi
Thanks for your tutorial, indeed its nice heads up. I was reading config.yml and unable to understand how sum format and date_format is working. Can you explain a little bit, based on it I will add some more fields in the parser.

Thanks in advance
Sagar

commented

date_format Matches dates like 19.08.15 and 19. 08. 2015: it's used to parse each date in a receipt
sum_format is analogous

I meant to say how this line is working, if i want to add for 19-Aug-2015 or 08/19/2015, how i should change following line of code. Hope now its clear

date_format: '.*?(?P(\d{2,4}(.\s?|[^a-zA-Z\d])\d{2}(.\s?|[^a-zA-Z\d])(20)?1[3-6]))\s+'

Working with regular expression is always a very... delicate endeavor.
Usually I use interactive tools like the awesome regex101 to come up with somewhat working expressions.
So here's a start, which matches your formats (including the existing ones):

(\d+)(.|-|\/)\s?([A-Za-z0-9]+)(.|-|\/)\s?(20)?(\d+)

Demo: https://regex101.com/r/HKXAbS/1

You might still need the .*? and the \s+ at the beginning and the end to make it work.

you are winner man...thanks @mre