ReceiptManager / receipt-parser-legacy

A supermarket receipt parser written in Python using tesseract OCR

Home Page:https://tech.trivago.com/2015/10/06/python_receipt_parser/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

make docker-run

schancee opened this issue · comments

Hi,

When I run "make docker-run", I get the following error:
"Removing tmp folder
pipenv run python -m parser
Traceback (most recent call last):
File "/usr/local/lib/python3.8/runpy.py", line 193, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.8/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/src/app/parser/main.py", line 11, in
main()
File "/usr/src/app/parser/main.py", line 6, in main
stats = parser.ocr_receipts(config, receipt_files)
File "/usr/src/app/parser/parser.py", line 125, in ocr_receipts
receipt = Receipt(config, receipt.readlines())
File "/usr/src/app/parser/receipt.py", line 40, in init
self.parse()
File "/usr/src/app/parser/receipt.py", line 62, in parse
self.date = self.parse_date()
File "/usr/src/app/parser/receipt.py", line 94, in parse_date
dateutil.parser.parse(date_str)
File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/root/.local/share/virtualenvs/app-lp47FrbD/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: 06.06. 2015
Text, Market, Date, Sum
rewe
1.0 REWE
data/txt/IMG0001.jpg.out.txt.txt REWE 04.12.2014 0.99
kaiser's tengelmanrı gmbh
0.8 Kaiser's
data/txt/IMG0006.jpg.out.txt.txt Kaiser's 31.08.2015 15.95
dm dm-drogerie markt
0.8 Drogerie
data/txt/IMG0008.jpg.out.txt.txt Drogerie 11.12.2014 5.85
penny h-milch
1.0 Penny
make: *** [Makefile:7: parse] Error 1
Makefile:22: recipe for target 'docker-run' failed
make: *** [docker-run] Error 2
"
Any ideas what the problem could be?

Look at what triggered the error:

raise ParserError("Unknown string format: %s", timestr) dateutil.parser._parser.ParserError: Unknown string format: 06.06. 2015

This suggests that regexp for capturing dates inside the config.yml cannot match what it got from performing OCR for: "06.06. 2015".

From config.yml:
# Matches dates like 19.08.15 and 19. 08. 2015 date_format: '.*?(?P<date>(\d{2,4}(\.\s?|[^a-zA-Z\d])\d{2}(\.\s?|[^a-zA-Z\d])(20)?1[3-6]))\s+'

Including that case in the regexp above should solve it.

I didn't look at the code in reciept.py and objectview.py so there could also be a problem there.

Yeah, I think it's what @mostlyfabulous already mentioned. The regex in the config would have to be expanded to support the format (06.06. 2015). Closing this to keep the issue tracker clean.