Parsing date fails with unsanitized input
DDzwiedziu opened this issue · comments
Using the included images:
❯ LANG=C make run
poetry run python parser/importer.py
Found the following images in /home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img
['IMG0007.jpg', 'IMG0003.jpg', 'IMG0001.jpg', 'IMG0004.jpg', 'IMG0008.jpg', 'IMG0006.jpg']
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0007.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 233 diacritics
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0003.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 8 diacritics
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0001.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0004.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 62 diacritics
Image too small to scale!! (2x36 vs min width of 3)
Line cannot be recognized!!
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0008.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0006.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
poetry run
Text, Market, Date, Sum
2 real
1.0 Real
data/txt/IMG0004.jpg.out.txt.txt Real None 9.31
rewe
1.0 REWE
data/txt/IMG0001.jpg.out.txt.txt REWE 04.12.2014 0.99
dm dm-drogerie markt
0.8 Drogerie
data/txt/IMG0008.jpg.out.txt.txt Drogerie 11.12.2014 5.85
penny h-milch
1.0 Penny
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/__init__.py", line 6, in main
stats = ocr_receipts(config, receipt_files)
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/parse.py", line 124, in ocr_receipts
receipt = Receipt(config, receipt.readlines())
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 40, in __init__
self.parse()
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 62, in parse
self.date = self.parse_date()
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 94, in parse_date
dateutil.parser.parse(date_str)
File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: 06.06. 2015
make: *** [Makefile:7: parse] Error 1
Notice the space in the date: "06.06. 2015".
Thanks for the report @DDzwiedziu.
I updated the date in the config file a bit to cover this case. Here's the change.
Let me know how it goes.
Unfortunately you've dropped a ')' somewhere:
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0006.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
poetry run
Text, Market, Date, Sum
2 real
1.0 Real
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/__init__.py", line 6, in main
stats = ocr_receipts(config, receipt_files)
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/parse.py", line 124, in ocr_receipts
receipt = Receipt(config, receipt.readlines())
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 40, in __init__
self.parse()
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 62, in parse
self.date = self.parse_date()
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 90, in parse_date
m = re.match(self.config.date_format, line)
File "/usr/lib/python3.8/re.py", line 189, in match
return _compile(pattern, flags).match(string)
File "/usr/lib/python3.8/re.py", line 302, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.8/sre_parse.py", line 836, in _parse
raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 3
make: *** [Makefile:7: parse] Błąd 1
("Błąd" == "Error")
.*?(?P<date>(\d{2,4}(\.\s?|[^a-zA-Z\d])\d{2}(\.\s?|[^a-zA-Z\d])(19|20)?\d\d)\s+)
and .*?(?P<date>(\d{2,4}(\.\s?|[^a-zA-Z\d])\d{2}(\.\s?|[^a-zA-Z\d])(19|20)?\d\d))\s+
are ending up with this:
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0006.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
poetry run
Text, Market, Date, Sum
2 real
1.0 Real
Traceback (most recent call last):
File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 655, in parse
ret = self._build_naive(res, default)
File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1238, in _build_naive
if cday > monthrange(cyear, cmonth)[1]:
File "/usr/lib/python3.8/calendar.py", line 124, in monthrange
raise IllegalMonthError(month)
calendar.IllegalMonthError: bad month number 14; must be 1-12
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/__init__.py", line 6, in main
stats = ocr_receipts(config, receipt_files)
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/parse.py", line 124, in ocr_receipts
receipt = Receipt(config, receipt.readlines())
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 40, in __init__
self.parse()
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 62, in parse
self.date = self.parse_date()
File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 94, in parse_date
dateutil.parser.parse(date_str)
File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 657, in parse
six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
make: *** [Makefile:7: parse] Error 1
Nice! :smile: Can you play around a bit with the regex in the config to see if you can fix it? Maybe you can send me a PR with the fix. :hugs: