Parsing date fails with unsanitized input

Question

Parsing date fails with unsanitized input

DDzwiedziu opened this issue 4 years ago · comments

Using the included images:

❯ LANG=C make run
poetry run python parser/importer.py
Found the following images in /home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img
['IMG0007.jpg', 'IMG0003.jpg', 'IMG0001.jpg', 'IMG0004.jpg', 'IMG0008.jpg', 'IMG0006.jpg']
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0007.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0007.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 233 diacritics
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0003.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0003.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 8 diacritics
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0001.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0001.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0004.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0004.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Detected 62 diacritics
Image too small to scale!! (2x36 vs min width of 3)
Line cannot be recognized!!
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0008.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0008.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Running convert -rotate ' 90' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/img/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg'
Running convert -auto-level -sharpen 0x4.0 -contrast '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg'
Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0006.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
poetry run
Text, Market, Date, Sum
2 real
 1.0 Real
data/txt/IMG0004.jpg.out.txt.txt Real None 9.31
rewe
 1.0 REWE
data/txt/IMG0001.jpg.out.txt.txt REWE 04.12.2014 0.99
dm dm-drogerie markt
 0.8 Drogerie
data/txt/IMG0008.jpg.out.txt.txt Drogerie 11.12.2014 5.85
penny h-milch
 1.0 Penny
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/__init__.py", line 6, in main
    stats = ocr_receipts(config, receipt_files)
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/parse.py", line 124, in ocr_receipts
    receipt = Receipt(config, receipt.readlines())
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 40, in __init__
    self.parse()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 62, in parse
    self.date = self.parse_date()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 94, in parse_date
    dateutil.parser.parse(date_str)
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 649, in parse
    raise ParserError("Unknown string format: %s", timestr)
dateutil.parser._parser.ParserError: Unknown string format: 06.06. 2015
make: *** [Makefile:7: parse] Error 1

Notice the space in the date: "06.06. 2015".

Matthias Endler · Answer 1 · Thu Sep 10 2020 19:30:20 GMT+0800 (China Standard Time)

Thanks for the report @DDzwiedziu.
I updated the date in the config file a bit to cover this case. Here's the change.
Let me know how it goes.

Dźwiedziu · Answer 2 · Thu Sep 10 2020 21:43:47 GMT+0800 (China Standard Time)

Unfortunately you've dropped a ')' somewhere:

Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0006.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
poetry run
Text, Market, Date, Sum
2 real
 1.0 Real
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/__init__.py", line 6, in main
    stats = ocr_receipts(config, receipt_files)
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/parse.py", line 124, in ocr_receipts
    receipt = Receipt(config, receipt.readlines())
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 40, in __init__
    self.parse()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 62, in parse
    self.date = self.parse_date()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 90, in parse_date
    m = re.match(self.config.date_format, line)
  File "/usr/lib/python3.8/re.py", line 189, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 302, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 836, in _parse
    raise source.error("missing ), unterminated subpattern",
re.error: missing ), unterminated subpattern at position 3
make: *** [Makefile:7: parse] Błąd 1

("Błąd" == "Error")

.*?(?P<date>(\d{2,4}(\.\s?|[^a-zA-Z\d])\d{2}(\.\s?|[^a-zA-Z\d])(19|20)?\d\d)\s+) and .*?(?P<date>(\d{2,4}(\.\s?|[^a-zA-Z\d])\d{2}(\.\s?|[^a-zA-Z\d])(19|20)?\d\d))\s+ are ending up with this:

Running tesseract -l deu '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/tmp/IMG0006.jpg' '/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/data/txt/IMG0006.jpg.out.txt'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
poetry run
Text, Market, Date, Sum
2 real
 1.0 Real
Traceback (most recent call last):
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 655, in parse
    ret = self._build_naive(res, default)
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1238, in _build_naive
    if cday > monthrange(cyear, cmonth)[1]:
  File "/usr/lib/python3.8/calendar.py", line 124, in monthrange
    raise IllegalMonthError(month)
calendar.IllegalMonthError: bad month number 14; must be 1-12

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/__init__.py", line 6, in main
    stats = ocr_receipts(config, receipt_files)
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/parse.py", line 124, in ocr_receipts
    receipt = Receipt(config, receipt.readlines())
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 40, in __init__
    self.parse()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 62, in parse
    self.date = self.parse_date()
  File "/home/dzwiedziu/Softwarez/gitbuckets/receipt-parser/parser/receipt.py", line 94, in parse_date
    dateutil.parser.parse(date_str)
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/dzwiedziu/.cache/pypoetry/virtualenvs/parser-dlSOXmLn-py3.8/lib/python3.8/site-packages/dateutil/parser/_parser.py", line 657, in parse
    six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
make: *** [Makefile:7: parse] Error 1

Matthias Endler · Answer 3 · Fri Sep 11 2020 03:14:18 GMT+0800 (China Standard Time)

Nice! :smile: Can you play around a bit with the regex in the config to see if you can fix it? Maybe you can send me a PR with the fix. :hugs: